Bright Data to Google Sheets, clean scrape results
You scrape a page, paste the output somewhere, and then spend the next hour untangling messy HTML, half-missing fields, and random formatting. It’s not the scraping that slows you down. It’s everything after.
Marketing researchers feel this when they’re building lists fast. A data analyst feels it when a “quick pull” turns into manual cleanup. Even a product lead doing competitive checks gets dragged into it. Bright Data Sheets automation fixes the boring part, so your spreadsheet becomes usable the moment it fills.
This workflow scrapes with Bright Data via an AI agent, normalizes the results, and lands clean, structured rows in Google Sheets. You’ll see what it removes, what you get back, and what you need to run it.
How This Automation Works
See how this solves the problem:
n8n Workflow Template: Bright Data to Google Sheets, clean scrape results
flowchart LR
subgraph sg0["When clicking ‘Test workflow’ Flow"]
direction LR
n0@{ icon: "mdi:robot", form: "rounded", label: "AI Agent", pos: "b", h: 48 }
n1@{ icon: "mdi:play-circle", form: "rounded", label: "When clicking ‘Test workflow’", pos: "b", h: 48 }
n2@{ icon: "mdi:cog", form: "rounded", label: "MCP Client list all tools fo..", pos: "b", h: 48 }
n3@{ icon: "mdi:cog", form: "rounded", label: "MCP Client List all tools", pos: "b", h: 48 }
n4@{ icon: "mdi:cog", form: "rounded", label: "MCP Client Bright Data Web S..", pos: "b", h: 48 }
n5["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Webhook for web scraper"]
n6@{ icon: "mdi:swap-vertical", form: "rounded", label: "Set the URLs", pos: "b", h: 48 }
n7@{ icon: "mdi:cog", form: "rounded", label: "MCP Client to Scrape as Mark..", pos: "b", h: 48 }
n8@{ icon: "mdi:cog", form: "rounded", label: "MCP Client to Scrape as HTML", pos: "b", h: 48 }
n9@{ icon: "mdi:brain", form: "rounded", label: "Google Gemini Chat Model for..", pos: "b", h: 48 }
n10@{ icon: "mdi:memory", form: "rounded", label: "Simple Memory", pos: "b", h: 48 }
n11["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Webhook for Web Scraper AI A.."]
n12@{ icon: "mdi:swap-vertical", form: "rounded", label: "Set the URL with the Webhook..", pos: "b", h: 48 }
n13@{ icon: "mdi:code-braces", form: "rounded", label: "Create a binary data", pos: "b", h: 48 }
n14@{ icon: "mdi:cog", form: "rounded", label: "Write the scraped content to..", pos: "b", h: 48 }
n0 --> n11
n0 --> n13
n6 --> n4
n10 -.-> n0
n13 --> n14
n3 -.-> n0
n8 -.-> n0
n7 -.-> n0
n1 --> n2
n1 --> n12
n4 --> n5
n9 -.-> n0
n2 --> n6
n12 --> n0
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n1 trigger
class n0 ai
class n9 aiModel
class n10 ai
class n5,n11 api
class n13 code
classDef customIcon fill:none,stroke:none
class n5,n11 customIcon
The Challenge: Clean web data never arrives clean
Web scraping sounds simple until you try to use the output. One page returns neat labels, the next page hides the same info inside nested elements, and suddenly your “dataset” is 40 lines of markup per item. Then comes the second job: turning that raw scrape into something a spreadsheet can actually work with. You copy-paste, break columns, re-run scrapes because a field was missed, and try to remember which version is the “final” one. Honestly, that mental overhead is what makes people stop doing research consistently.
It adds up fast. Here’s where it usually breaks down in real teams:
- You end up cleaning HTML and markdown by hand, which is slow and easy to mess up.
- A single missing field forces re-scraping because the source is not standardized across pages.
- Results land in files or chat messages, so the “real list” lives in five places at once.
- As soon as volume increases, quality drops because nobody has time to validate every row.
The Fix: Bright Data scrape output that lands as clean rows
This workflow starts with a set of target URLs, then uses Bright Data’s MCP Server to scrape each page in both markdown and HTML formats. An autonomous AI agent (backed by a chat model) decides which scraping tool to use and how to interpret the response so you don’t have to babysit selectors. Next, the workflow restructures the raw output into predictable fields, builds a clean payload, and writes results to disk for traceability. At the same time, it sends a webhook notification so other systems can react (or you can just see that it worked). Finally, the cleaned dataset is ready to be logged into Google Sheets (and you can mirror it to Excel if your team lives in Microsoft 365).
The workflow kicks off, fetches the available Bright Data tools, assigns your target links, and runs the scrape. From there, the AI agent formats the response into something consistent, then the workflow packages the output for storage and reporting. You end up with reliable, spreadsheet-friendly data instead of a pile of raw page content.
What Changes: Before vs. After
| What This Eliminates | Impact You’ll See |
|---|---|
|
|
Real-World Impact
Say you’re collecting competitive info from 30 URLs each week. Manually, you might spend about 5 minutes per page scraping, then another 5 minutes cleaning and pasting into Google Sheets, which is roughly 5 hours of low-value work. With this workflow, you drop the URLs in once, wait for the scrape and AI formatting to finish (often around 20 minutes total), and the output is ready to log as clean rows. That’s basically an afternoon returned to you every week.
Requirements
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Bright Data to run scraping via Web Unlocker.
- Google Sheets to store and share clean rows.
- Google Gemini API key (get it from Google AI Studio or Vertex AI).
Skill level: Intermediate. You’ll connect credentials, install a community node (self-hosted), and tweak a few fields like URLs and webhook endpoints.
Need help implementing this? Talk to an automation expert (free 15-minute consultation).
The Workflow Flow
A manual run (or your own trigger) starts the job. In the template it begins with a manual execution, but you can swap that for a webhook, a form submission, or a scheduled run when you want fresh data.
Bright Data tools are discovered, then your target links are assigned. The workflow pulls the MCP tool catalog and maps the URLs you want scraped, so the agent has the right “actions” available for the job.
The scrape runs, then an AI agent structures the output. Bright Data returns page content (markdown and HTML). The Gemini-backed agent interprets it, keeps context in memory, and reshapes it into consistent fields you can actually use downstream.
Outputs are packaged and sent where you need them. The workflow writes a file to disk for record-keeping, and it can dispatch results via webhook so Google Sheets (or another system) receives clean, predictable data.
You can easily modify the input method (manual run vs. webhook vs. form) to match how your team collects URLs. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Manual Trigger
Start the workflow with a manual trigger so you can test and iterate on scraping behavior before scheduling or external triggers.
- Add the Manual Execution Start node as the trigger.
- Connect Manual Execution Start to both Fetch MCP Tool Catalog and Prepare URL and Format to match the parallel flow.
- Confirm the parallel branch behavior: Manual Execution Start outputs to both Fetch MCP Tool Catalog and Prepare URL and Format in parallel.
Step 2: Connect MCP Tools and Target URLs
Load MCP tools, then define target URLs and webhook destinations used by the scraping flow.
- Open Fetch MCP Tool Catalog and connect credentials. Credential Required: Connect your mcpClientApi credentials.
- In Assign Target Links, set url to
https://about.google/and webhook_url tohttps://webhook.site/[YOUR_ID]. - Confirm the flow: Fetch MCP Tool Catalog → Assign Target Links → Run Bright Data Scrape.
https://webhook.site/[YOUR_ID] with a real webhook URL before testing to capture responses.Step 3: Prepare the AI Agent Inputs and Memory
Define the URL, webhook destination, and format inputs that the AI agent uses to orchestrate scraping.
- In Prepare URL and Format, set url to
https://about.google/, webhook_url tohttps://webhook.site/[YOUR_ID], and format toscrape_as_markdown. - In Buffer Memory Window, set Session Key to
=Perform the web scraping for the below URL {{ $json.url }}and Context Window Length to10. - Connect Buffer Memory Window to Autonomous Scrape Agent via the AI memory connection.
Step 4: Configure the AI Agent and MCP Tooling
Set up the AI agent, its language model, and the MCP tools it can invoke for scraping.
- In Autonomous Scrape Agent, set the Text prompt to
=Scrape the web data as per the provided URL: {{ $json.url }} using the format as {{ $json.format }}. - In Gemini Chat Model, select Model
models/gemini-2.0-flash-expand connect credentials. Credential Required: Connect your googlePalmApi credentials. - Connect Gemini Chat Model to Autonomous Scrape Agent as the language model.
- Ensure Expose MCP Tools, MCP Markdown Scraper, and MCP HTML Scraper are connected to Autonomous Scrape Agent as AI tools. Credential Required: Connect your mcpClientApi credentials.
Step 5: Configure Direct Scraping and Webhook Dispatch
The workflow runs a direct MCP scrape in addition to the AI agent. This path posts scrape results to a webhook.
- In Run Bright Data Scrape, set Tool Name to
=scrape_as_markdown, Operation toexecuteTool, and Tool Parameters to={ "url": "{{ $json.url }}" }. Credential Required: Connect your mcpClientApi credentials. - In Webhook Dispatch for Scrape, set URL to
=https://webhook.site/[YOUR_ID]and enable Send Body. - Set the body parameter response to
={{ $json.result.content[0].text }}so the scraped content is delivered.
={{ $json.result.content[0].text }}.Step 6: Configure Agent Output Delivery and File Storage
Send agent outputs to a webhook and store the full payload locally as a JSON file.
- In Webhook for Agent Output, set URL to
={{ $('Prepare URL and Format').item.json.webhook_url }}and enable Send Body. - Set the body parameter response to
={{ $json.output }}to send the agent’s output. - In Build Binary Payload, keep the Function Code as provided to convert the JSON to a binary buffer.
- In Write Scrape File, set Operation to
writeand File Name tod:\Scraped-Content.json. - Confirm the parallel branch behavior: Autonomous Scrape Agent outputs to both Webhook for Agent Output and Build Binary Payload in parallel.
d:\ or change the path to a valid directory for your environment.Step 7: Test & Activate Your Workflow
Run a manual test to confirm the agent, scraping tools, webhook dispatch, and file writing all succeed.
- Click Execute Workflow from Manual Execution Start to run the entire flow.
- Verify that your webhook receives data from both Webhook Dispatch for Scrape and Webhook for Agent Output.
- Check the file system for
d:\Scraped-Content.jsoncreated by Write Scrape File. - If all outputs look correct, switch the workflow to Active for production use.
Watch Out For
- Bright Data credentials and zone settings matter. If the scrape fails, check your Bright Data API token and confirm the Web Unlocker zone (often named mcp_unlocker) is active in the Bright Data control panel.
- If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.
Common Questions
Plan on about an hour if your Bright Data and Gemini accounts are ready.
Yes, but you will want one person who’s comfortable with API keys and connecting accounts. The workflow logic is already built; most of the work is setup and testing on a few URLs.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in Bright Data usage and Gemini API costs.
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
You can swap the input without changing the core scrape logic. For example, replace the Manual Execution Start with a webhook or a Jotform Trigger, then map incoming URLs into the “Prepare URL and Format” and “Assign Target Links” nodes. Many teams also customize the AI agent instructions (Gemini Chat Model) to extract different fields, and change the webhook dispatch node to send results to Slack, Airtable, or a CRM instead of a sheet.
Usually it’s an invalid or missing API token inside the MCP Client environment settings.
On n8n Cloud, capacity depends on your plan’s monthly executions, and higher-volume plans handle bigger scraping schedules. If you self-host, there’s no platform execution cap, but your server resources and Bright Data limits still apply. Practically, this workflow is comfortable running small batches all day, then scaling up once you’ve validated the fields you care about. If you’re scraping hundreds of URLs daily, you’ll want batching and error handling tuned so retries don’t flood your webhook outputs.
For this workflow, n8n has a few advantages: more complex logic with unlimited branching at no extra cost, a self-hosting option for unlimited executions, and native HTTP + file handling that many Zapier-style flows make awkward or expensive. The other big factor is the community MCP Client node, which is not a typical “plug and play” connector in Zapier. Zapier or Make can still be fine if you only need to capture a couple of fields from stable pages and push them to Sheets. Once pages get dynamic and you want an agent to choose tools and formats, n8n is simply a better fit. Talk to an automation expert if you’re not sure which fits.
When your scrape output lands as clean spreadsheet rows, research stops being a one-off chore. Set it up once, then let the workflow keep your data usable.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.