Website pages to Google Sheets, research ready notes
You start research with good intentions, then you’re buried in tabs. Copying links into a spreadsheet, grabbing images for reference, pasting chunks of text into “notes” that you’ll never clean up later. It’s slow, and it’s weirdly exhausting.
This website to Sheets automation hits marketers and SEO leads first, but it also saves agency operators who build competitor decks on tight timelines. You get one Google Sheet row per site with links, images, and Markdown-ready page content, which means research that’s searchable and reusable.
Below, you’ll see how the workflow crawls a site from the homepage, filters what matters, and appends clean outputs to Google Sheets so your “research” stops being a messy browser session.
How This Automation Works
See how this solves the problem:
n8n Workflow Template: Website pages to Google Sheets, research ready notes
flowchart LR
subgraph sg0["Manual Flow"]
direction LR
n0@{ icon: "mdi:swap-vertical", form: "rounded", label: "Set Website", pos: "b", h: 48 }
n1@{ icon: "mdi:play-circle", form: "rounded", label: "Manual Trigger", pos: "b", h: 48 }
n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Scrape Homepage"]
n3["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/html.dark.svg' width='40' height='40' /></div><br/>Extract Links from HTML"]
n4@{ icon: "mdi:swap-vertical", form: "rounded", label: "Split Links", pos: "b", h: 48 }
n5@{ icon: "mdi:cog", form: "rounded", label: "Remove Duplicate Links", pos: "b", h: 48 }
n6@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Filter Real Hyperlinks", pos: "b", h: 48 }
n7@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Separate Images and Links", pos: "b", h: 48 }
n8@{ icon: "mdi:cog", form: "rounded", label: "Aggregate Images", pos: "b", h: 48 }
n9@{ icon: "mdi:cog", form: "rounded", label: "Aggregate Links", pos: "b", h: 48 }
n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Scrape Content Links"]
n11["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/markdown.dark.svg' width='40' height='40' /></div><br/>Convert to Markdown"]
n12@{ icon: "mdi:cog", form: "rounded", label: "Aggregate Scraped Content", pos: "b", h: 48 }
n13@{ icon: "mdi:database", form: "rounded", label: "Add Images to Sheet", pos: "b", h: 48 }
n14@{ icon: "mdi:database", form: "rounded", label: "Add Links to Sheet", pos: "b", h: 48 }
n15@{ icon: "mdi:database", form: "rounded", label: "Add Scraped Content to Sheet", pos: "b", h: 48 }
n0 --> n2
n4 --> n5
n1 --> n0
n9 --> n14
n2 --> n3
n8 --> n13
n11 --> n12
n10 --> n11
n6 --> n7
n5 --> n6
n3 --> n4
n12 --> n15
n7 --> n8
n7 --> n9
n7 --> n10
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n1 trigger
class n6,n7 decision
class n13,n14,n15 database
class n2,n10 api
classDef customIcon fill:none,stroke:none
class n2,n3,n10,n11 customIcon
The Challenge: Turning a Website Into Usable Research Notes
A website looks organized until you try to “capture it.” The homepage links out to product pages, documentation, blogs, case studies, and random campaign landing pages that may or may not still matter. If you collect it manually, you end up with half a list, broken URLs, missing images, and copy-pasted snippets that don’t keep their structure. Then comes the worst part: you don’t trust your notes, so you re-check everything when it’s time to write the audit or brief. That’s wasted time, plus mental load you didn’t plan for.
Here’s where it breaks down in real life.
- You lose about 2 hours per site just collecting links and keeping them tidy.
- Images get saved “somewhere,” and later you can’t remember which page they came from.
- Copy-pasting content strips formatting, which makes it harder to scan and reuse in docs or AI tools.
- Duplicates and non-HTTP links sneak in, so your sheet looks complete but behaves like junk data.
The Fix: Crawl, Filter, and Append Clean Website Notes to Sheets
This workflow turns a website into a structured, research-ready snapshot inside Google Sheets. You start by defining a target URL (usually the homepage). The automation fetches that homepage HTML, pulls every link it can find, then splits the list into individual URLs so it can clean them properly. It removes duplicates, validates that the links are real web links (not mailto, anchors, or odd formats), and then routes them into two buckets: image assets and actual content pages. For pages, it fetches the HTML, converts it into Markdown so headings and lists stay readable, bundles the results, and appends everything to your Google Sheet.
The flow begins with one website URL and one run. From there, HTTP Request does the fetching, the workflow separates images vs. pages using a Switch, and Google Sheets gets three clean appends (images, links, content) that you can search, filter, and export later.
What Changes: Before vs. After
| What This Eliminates | Impact You’ll See |
|---|---|
|
|
Real-World Impact
Say you’re doing competitor research for 5 sites in a week. Manually, it’s easy to spend about 10 minutes collecting links, another 10 minutes grabbing images and references, and about 20 minutes copying key page text per site (roughly 40 minutes each). That’s around 3+ hours weekly, and the output is inconsistent. With this workflow, you set the URL, run it, and get links, images, and Markdown content appended to Google Sheets in one go, usually in under 20 minutes per site including waiting on requests. You get most of that time back, and the sheet is actually reusable.
Requirements
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Google Sheets for storing links, images, and content.
- Google account (OAuth) to allow n8n to edit your sheet.
- Target website URL (use a homepage or section hub).
Skill level: Beginner. You’ll connect Google credentials, paste a URL, and map your sheet columns once.
Need help implementing this? Talk to an automation expert (free 15-minute consultation).
The Workflow Flow
You define the website to crawl. A Set node stores your website URL so the rest of the workflow always knows what “home” is. Most people point it at the homepage, but a resources hub works too.
The homepage gets fetched and links are extracted. HTTP Request pulls the raw HTML, then the workflow parses out link targets and explodes them into a list of individual URLs it can evaluate one-by-one.
Links are cleaned, validated, and routed. Duplicates are removed, only HTTPS (or valid web) links are kept, and a Switch separates image assets from content pages so you don’t mix media with text.
Content pages are converted into Markdown and saved. The workflow fetches each page’s HTML, converts it to Markdown for readability, bundles the output, then appends images, links, and content into Google Sheets using dedicated “Add … to Sheet” actions.
You can easily modify the filtering rules to crawl only certain paths (like /blog) based on your needs. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Manual Trigger
This workflow starts with a manual run so you can test crawling before scheduling it.
- Add or verify the Manual Launch Start node as the trigger.
- Connect Manual Launch Start to Define Website URL to pass the input into the crawl flow.
Step 2: Connect Google Sheets
These nodes write the crawl results to your Google Sheet.
- Open Append Images to Sheet and set Operation to
appendOrUpdate, Sheet Name toyour-sheet-name, and Document ID toyour-document-id. - Set Images mapping to
{{ $json.links.join('\n\n') }}and Website mapping to{{ $('Define Website URL').item.json.website_url }}. - Credential Required: Connect your googleSheetsOAuth2Api credentials in Append Images to Sheet.
- Open Append Links to Sheet and set Operation to
appendOrUpdate, Sheet Name toyour-sheet-name, and Document ID toyour-document-id. - Set Links mapping to
{{ $json.links.join('\n\n') }}and Website mapping to{{ $('Define Website URL').item.json.website_url }}. - Credential Required: Connect your googleSheetsOAuth2Api credentials in Append Links to Sheet.
- Open Append Content to Sheet and set Operation to
appendOrUpdate, Sheet Name toyour-sheet-name, and Document ID toyour-document-id. - Set Website mapping to
{{ $('Define Website URL').item.json.website_url }}and Scraped Content mapping to{{ $json.data.join('\n\n').slice(0, 50000) }}. - Credential Required: Connect your googleSheetsOAuth2Api credentials in Append Content to Sheet.
Step 3: Set Up Website URL and Homepage Fetch
This section sets the target site and pulls the homepage HTML for link extraction.
- In Define Website URL, set website_url to the target site, such as
https://example.com. - In Fetch Homepage HTML, set URL to
{{ $json.website_url }}. - Keep the Fetch Homepage HTML options defaults unless your site needs special headers.
Step 4: Extract, Filter, and Route Links
These nodes extract links, remove duplicates, validate HTTPS, and route images vs pages.
- In Pull Link Targets, keep Operation set to
extractHtmlContentwith CSS Selectoraand Attributehrefreturning an array to links. - Set Explode Link List Field to Split Out to
linksso each link becomes its own item. - Keep Deduplicate URLs to remove duplicate link entries.
- In Validate HTTPS Links, use the condition
{{ $json.links }}starts withhttps://. - In Route Images vs Pages, verify the image regex rule uses
{{ $json.links }}and the regex=^https?:\/\/.*\.(?:png|jpe?g|gif|webp|bmp|svg|ico)(?:\?.*)?$.
Route Images vs Pages outputs to both Group Image URLs and the page branch (Group Page URLs → Fetch Page Content) in parallel.
Step 5: Fetch Page Content and Convert to Markdown
This branch crawls non-image pages and bundles their text content.
- In Group Page URLs, keep Field to Aggregate set to
linksto bundle page URLs for sheet output. - In Fetch Page Content, set URL to
{{ $json.links }}so each page is requested. - In HTML to Markdown, set HTML to
{{ $json.data }}to convert fetched HTML into markdown. - In Bundle Page Text, aggregate data to prepare content for the sheet.
Step 6: Configure Output Aggregation to Sheets
Image URLs, page links, and content are aggregated and appended into the sheet.
- Verify Group Image URLs aggregates the links field before sending to Append Images to Sheet.
- Verify Group Page URLs connects to Append Links to Sheet for link storage.
- Ensure Bundle Page Text connects to Append Content to Sheet to store the scraped content.
Step 7: Test and Activate Your Workflow
Run a manual test to confirm links, images, and content are written to Google Sheets.
- Click Execute Workflow and trigger Manual Launch Start.
- Confirm Append Links to Sheet, Append Images to Sheet, and Append Content to Sheet each add a row for the target Website.
- If results are missing, inspect Validate HTTPS Links and Route Images vs Pages outputs for filtered items.
- When results look correct, set the workflow to Active for production runs.
Watch Out For
- Google Sheets credentials can expire or need specific permissions. If things break, check the credential in n8n and confirm the Google account can edit that spreadsheet.
- Some websites block rapid requests or return different HTML to bots. If the crawl suddenly comes back empty, slow it down with a Wait node after the homepage fetch and rerun.
- Google Sheets cells cap out around 50k characters. If you crawl long pages, keep the slice limit in place or split content across multiple rows so you don’t lose data silently.
Common Questions
About 30 minutes if your Google account is ready.
Yes, because there’s no code involved. You’ll mainly connect Google Sheets and paste in the website URL you want to crawl.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in Google usage (usually free for normal Sheets use).
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
You can. The easiest wins are in the filtering and routing: tweak “Validate HTTPS Links” to include only certain paths (like /blog/), and adjust “Route Images vs Pages” if you want to keep PDFs or exclude file types. If you need more depth, add another HTTP fetch pass after “Pull Link Targets” to crawl a second layer. Common customizations include saving page titles, adding a “Category” column, or splitting long Markdown into multiple rows.
Usually it’s an OAuth issue: the credential expired, the wrong Google account is connected, or the account doesn’t have edit access to that spreadsheet. Update the credential in n8n, then double-check the spreadsheet ID and sheet name in each “Append … to Sheet” node. If it still fails, look for a permissions prompt in your Google account and re-authorize.
It depends more on the target site than on n8n. For most marketing sites, crawling a few dozen internal links per run is fine, but very large sites may hit rate limits or produce content too large for a single Google Sheets cell. If you self-host n8n, execution volume is effectively limited by your server. On n8n Cloud, higher plans support more monthly executions, so you can crawl more sites without babysitting it.
Often, yes, because this is more than a simple “trigger then write a row.” You’re doing deduplication, validation, routing, aggregation, and conversion to Markdown, which is the kind of multi-step logic that gets awkward (and pricey) in Zapier. n8n also lets you self-host, so you can run lots of crawls without worrying about task counts. Zapier or Make can still be fine for a tiny version, like capturing one URL at a time from a form. If you want help picking the right tool for your exact setup, Talk to an automation expert.
Once this is running, your “research” becomes a repeatable asset instead of a one-off scramble. Set it up, crawl what you need, and move on to the work that actually pays off.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.