Google Sheets + OpenAI: crawl sites, audit faster
You open a site audit and immediately drown in tabs. Home, services, blog categories, random landing pages, then another tab for headings, another for links, another for notes you swear you’ll organize later.
This is where site crawl automation helps. SEO analysts usually feel the pain first, but content leads and agency account managers get stuck in the same loop: find page, copy headings, count links, summarize, repeat.
This workflow crawls a site, stores the useful bits in Google Sheets, then lets you ask questions like a chatbot using that saved data. You’ll see what it does, what you need, and where teams typically save a few hours per audit.
How This Automation Works
See how this solves the problem:
n8n Workflow Template: Google Sheets + OpenAI: crawl sites, audit faster
flowchart LR
subgraph sg0["Chat web Flow"]
direction LR
n0@{ icon: "mdi:robot", form: "rounded", label: "AI Agent", pos: "b", h: 48 }
n1@{ icon: "mdi:brain", form: "rounded", label: "OpenAI Chat Model", pos: "b", h: 48 }
n2@{ icon: "mdi:memory", form: "rounded", label: "Simple Memory", pos: "b", h: 48 }
n3@{ icon: "mdi:database", form: "rounded", label: "Get row(s) in sheet in Googl..", pos: "b", h: 48 }
n4@{ icon: "mdi:swap-horizontal", form: "rounded", label: "If", pos: "b", h: 48 }
n5["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Maping Sitemap"]
n6@{ icon: "mdi:cog", form: "rounded", label: "XML1", pos: "b", h: 48 }
n7["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>UA Rotativo1"]
n8@{ icon: "mdi:location-exit", form: "rounded", label: "Req Error", pos: "b", h: 48 }
n9@{ icon: "mdi:location-exit", form: "rounded", label: "Sitemap Error", pos: "b", h: 48 }
n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Req robots"]
n11["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>extract sitemap url"]
n12@{ icon: "mdi:swap-vertical", form: "rounded", label: "OPTIONS", pos: "b", h: 48 }
n13@{ icon: "mdi:robot", form: "rounded", label: "AI Agent1", pos: "b", h: 48 }
n14@{ icon: "mdi:brain", form: "rounded", label: "OpenAI Chat Model1", pos: "b", h: 48 }
n15@{ icon: "mdi:play-circle", form: "rounded", label: "Chat web", pos: "b", h: 48 }
n16@{ icon: "mdi:robot", form: "rounded", label: "Structured Output Parser", pos: "b", h: 48 }
n17@{ icon: "mdi:swap-horizontal", form: "rounded", label: "If1", pos: "b", h: 48 }
n18@{ icon: "mdi:robot", form: "rounded", label: "Respond to Chat", pos: "b", h: 48 }
n19@{ icon: "mdi:robot", form: "rounded", label: "Respond to Chat1", pos: "b", h: 48 }
n20@{ icon: "mdi:robot", form: "rounded", label: "Message a model", pos: "b", h: 48 }
n21@{ icon: "mdi:cog", form: "rounded", label: "XML", pos: "b", h: 48 }
n22@{ icon: "mdi:swap-vertical", form: "rounded", label: "Loop Over Items", pos: "b", h: 48 }
n23@{ icon: "mdi:robot", form: "rounded", label: "Message a model1", pos: "b", h: 48 }
n24@{ icon: "mdi:database", form: "rounded", label: "Append row in sheet in Googl..", pos: "b", h: 48 }
n25@{ icon: "mdi:database", form: "rounded", label: "Complete", pos: "b", h: 48 }
n26@{ icon: "mdi:web", form: "rounded", label: "HTTP Request2", pos: "b", h: 48 }
n27["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Merge"]
n28@{ icon: "mdi:swap-vertical", form: "rounded", label: "Split URLs", pos: "b", h: 48 }
n29["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Req URL"]
n30["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/markdown.dark.svg' width='40' height='40' /></div><br/>HTML to Markdown"]
n31["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Maping Sitemaps"]
n32@{ icon: "mdi:database", form: "rounded", label: "Get data schema", pos: "b", h: 48 }
n4 --> n0
n4 --> n13
n17 --> n12
n17 --> n18
n21 --> n27
n6 --> n20
n27 --> n28
n12 --> n7
n29 --> n30
n0 --> n19
n15 --> n32
n13 --> n17
n10 --> n11
n10 --> n8
n28 --> n22
n7 --> n10
n26 -.-> n0
n2 -.-> n0
n5 --> n6
n5 --> n9
n32 --> n4
n22 --> n25
n22 --> n29
n31 --> n21
n20 --> n31
n30 --> n23
n23 --> n22
n1 -.-> n0
n14 -.-> n13
n11 --> n5
n16 -.-> n13
n24 -.-> n23
n3 -.-> n0
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n15 trigger
class n0,n13,n16,n18,n19,n20,n23 ai
class n1,n14 aiModel
class n2 ai
class n4,n17 decision
class n3,n24,n25,n32 database
class n5,n10,n26,n29,n31 api
class n7,n11,n27 code
classDef customIcon fill:none,stroke:none
class n5,n7,n10,n11,n27,n29,n30,n31 customIcon
The Challenge: Website audits turn into tab chaos
A “quick” website audit rarely stays quick. You start with one question (“what’s on the services pages?”), then you’re manually checking headings, scanning navigation paths, and pulling links into a doc that nobody ever reuses. The mental load is sneaky too. You’re constantly switching context, which makes you miss obvious issues like duplicate H1s, thin pages hiding in a sitemap, or outdated CTAs. And if you’re doing this for multiple client sites, it becomes the same repetitive grind every week.
It adds up fast. Here’s where it breaks down in real life.
- You end up copying headings and links by hand, and the moment you paste them into a doc they go stale.
- Sitemaps are inconsistent across sites, so you waste time hunting for the “right” URLs to review.
- Audits get stuck at “observations” because summarizing every important page takes forever.
- When someone asks a follow-up question a week later, you have to redo the crawl because your notes aren’t structured.
The Fix: Crawl once, store in Sheets, then ask questions
This workflow starts as a web consultation chatbot, but the real win is what it remembers. The first time you give it a website URL, it validates the URL with AI, finds the sitemap through robots.txt, and chooses the most relevant sitemap (pages, posts, categories, tags) based on your scan options. Then it crawls the selected URLs in batches, downloads each page’s HTML, converts it to Markdown, and uses OpenAI to extract the language, heading hierarchy, internal links, external links, and a clean summary. Each page becomes a structured row in Google Sheets, so you can filter, sort, and reuse it later. Once indexing is done, the sheet gets flagged as ready, which flips the workflow into “agent mode” for future questions.
After that first crawl, you don’t re-audit from scratch. You ask questions in chat, and the AI agent reads your Google Sheets “memory” to answer quickly. If something needs fresh info, it can also make real-time HTTP requests so you’re not stuck trusting old snapshots.
What Changes: Before vs. After
| What This Eliminates | Impact You’ll See |
|---|---|
|
|
Real-World Impact
Say you’re auditing a 50-page marketing site. Manually, you might spend about 5 minutes per page to open it, note the H1/H2s, skim for the gist, and grab a couple of key links, which is roughly 4 hours of pure busywork. With this workflow, you submit the URL once, let it crawl in the background, and you’re left with a Google Sheet containing headings, links, and summaries for every page. From there, answering “show me all H1s on services pages” takes minutes, not another afternoon.
Requirements
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Google Sheets to store crawl results and “memory”.
- OpenAI to extract headings, links, and summaries.
- OpenAI API key (get it from your OpenAI dashboard).
Skill level: Intermediate. You’ll connect accounts, paste API keys, and tweak scan options, but you won’t be writing real code unless you want to.
Need help implementing this? Talk to an automation expert (free 15-minute consultation).
The Workflow Flow
A chat message triggers the workflow. You enter a website URL in the chat interface, and the workflow first checks Google Sheets to see if that site has already been indexed.
It validates the URL and discovers the sitemap. If the site is new, an AI validator confirms the URL is usable, then an HTTP request pulls robots.txt to locate sitemap URLs. Another AI step selects the most relevant sitemap based on your scan options (for example, pages vs. posts).
The crawler processes pages in batches. Each URL is fetched with an HTTP request using a rotating user-agent, the HTML is converted to Markdown, and OpenAI analyzes the content to extract language, heading structure, internal links, external links, and a summary. Those results are appended into Google Sheets row-by-row.
Then it switches into “agent mode” for questions. Once the sheet is flagged as indexed, the LangChain-style agent reads your stored rows to answer questions like “what’s on the contact page?” It can also do live HTTP checks when you need up-to-date info.
You can easily modify which URL types get scanned (pages, posts, categories) to match your audit style. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Web Chat Trigger
Set up the inbound chat endpoint that starts the workflow and passes user input into the AI validation and conversation paths.
- Add and open Web Chat Trigger.
- Set Public to
trueand Authentication tobasicAuth. - Credential Required: Connect your httpBasicAuth credentials.
- Confirm that Web Chat Trigger outputs to Retrieve Data Schema as the first step in the flow.
Step 2: Connect Google Sheets
These nodes read and write the website content database used by the assistant and by the summarization process.
- Open Retrieve Data Schema and select the spreadsheet in Document ID and Sheet Name (e.g.,
gid=0). - In Retrieve Data Schema, ensure the filter uses Lookup Column
Data schemawith Lookup Value={{ true }}. - Credential Required: Connect your googleSheetsOAuth2Api credentials in Retrieve Data Schema.
- Open Schema Flag Update, set Operation to
appendOrUpdate, and map Data schema to={{true}}. - Credential Required: Connect your googleSheetsOAuth2Api credentials in Schema Flag Update.
- Open Append Sheet Row, keep Operation as
append, and confirm column mappings use the AI expressions like={{ /*n8n-auto-generated-fromAI-override*/ $fromAI('Summary_Content', ``, 'string') }}. - Credential Required: Connect your googleSheetsOAuth2Api credentials in Append Sheet Row.
Step 3: Set Up URL Validation and Schema Checks
This section validates user URLs, checks whether the schema is already present, and routes into either the chat response or sitemap scanning flow.
- Open Schema Check and confirm the condition uses
={{ $json['Data schema'] }}with a boolean true check. - Open URL Validator Agent and set Text to
={{ $node["Web Chat Trigger"].json["chatInput"] }}. - Attach Structured Parser to URL Validator Agent as the output parser and keep the JSON schema example:
{ "URL": "example.com", "URL_bool": true }
- Connect OpenAI Validator Model as the language model for URL Validator Agent.
- Credential Required: Connect your openAiApi credentials in OpenAI Validator Model (credentials are set on the model node, not the agent).
- Open URL Valid Check and confirm the condition uses
={{ $json.output.URL_bool }}with a boolean true check. - For invalid URLs, ensure Invalid URL Reply sends the message
Debes introducir una URL válida ejemplo: https://google.es.
URL_bool, URL Valid Check will not pass. Keep the structured parser schema intact.Step 4: Configure the Conversational Agent and Memory
This agent answers chat requests using the website data in Google Sheets and a language model, with memory enabled for context.
- Open Conversational Agent and set Text to
={{ $node["Web Chat Trigger"].json["chatInput"] }}. - Connect OpenAI Chat Engine as the language model for Conversational Agent.
- Credential Required: Connect your openAiApi credentials in OpenAI Chat Engine.
- Attach Lightweight Memory to Conversational Agent and keep Context Window Length at
50. - Attach Fetch Sheet Rows as an AI tool to Conversational Agent so it can retrieve website content from the sheet.
- Credential Required: Connect your googleSheetsOAuth2Api credentials in Fetch Sheet Rows (tools use their own credentials, but are invoked by the agent).
- Confirm Conversational Agent routes to Agent Response and that Agent Response uses
={{ $json.output }}for its message.
Step 5: Configure Sitemap Discovery and Selection
These nodes build a valid sitemap URL from robots.txt, fetch the sitemap, and select the correct sitemap section using AI.
- Open Scan Options and keep the boolean flags as configured: scan_pages
true, scan_postsfalse, categoryfalse, tagsfalse. - In Random UA Picker, keep the provided JavaScript that returns a random
userAgentstring. - Open Robots.txt Request and set URL to
={{ $node["URL Validator Agent"].json["output"]["URL"] }}/robots.txt. - Confirm Parse Sitemap URL parses the robots.txt content and returns
sitemapUrlas coded. - Open Sitemap Fetch and set URL to
={{ $json.sitemapUrl }}with headers including User-Agent={{ $json.userAgent }}. - Open XML Decode A to parse the sitemap XML and pass it to Sitemap Selector.
- Open Sitemap Selector, keep JSON Output enabled, and verify it references scan options and the sitemap index locs using expressions like
{{ $json.sitemapindex.sitemap[0].loc }}. - Credential Required: Connect your openAiApi credentials in Sitemap Selector.
- Open Page Sitemap Fetch and set URL to
={{ $json.message.content.sitemap_page }}, then route to XML Decode B.
null, Sitemap Fetch will fail and trigger Sitemap Failure Halt. Ensure robots.txt contains a valid Sitemap: line.Step 6: Configure URL Splitting, Page Requests, and Summarization
This stage expands the selected sitemap into individual URLs, fetches each page, converts HTML to Markdown, and summarizes the content with AI.
- Open Combine URL Map and keep the JavaScript that transforms
urlset.urlinto aurlsobject. - Open Split URL Items and set Field To Split Out to
urls. - Open Batch Iterator to control batch processing of URLs before requesting pages.
- Open Page Request and set URL to
={{ $('Split URL Items').item.json.urls }}with User-Agent header={{ $json.userAgent }}. - Open HTML Convert to MD and set HTML to
={{ $json.data }}. - Open Page Summary Model and ensure the message includes
=URL: {{ $('Split URL Items').item.json.urls }} {{ $json.data }}in the prompt. - Credential Required: Connect your openAiApi credentials in Page Summary Model.
- Confirm Page Summary Model uses Append Sheet Row as an AI tool to write results to the sheet.
Step 7: Add Error Handling
The workflow includes explicit stops for invalid URLs and sitemap failures.
- Verify Request Error Halt is connected from the error output of Robots.txt Request and uses the message
URL mal introducida, debes introducir con el siguiente formato: ejemplo.com. - Verify Sitemap Failure Halt is connected from the error output of Sitemap Fetch and uses the message
Sitemap no encontrado o acceso bloqueadp.
Step 8: Test and Activate Your Workflow
Run a controlled test to validate the end-to-end chat response and sitemap summarization flow.
- Click Execute Workflow and send a chat message through Web Chat Trigger with a URL like
https://example.com. - Confirm a valid URL routes through URL Valid Check to Scan Options, Robots.txt Request, and Sitemap Fetch.
- Verify that Page Summary Model runs and Append Sheet Row adds a row with
Lang,Page URL, andSummary Content. - Check that Agent Response returns a chat reply when Schema Check is true and that Invalid URL Reply triggers for invalid URLs.
- Once validated, switch the workflow to Active for production use.
Watch Out For
- Google Sheets credentials can expire or need specific permissions. If things break, check the n8n credential connection and the sheet sharing settings first.
- If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.
Common Questions
About 30 minutes if your accounts and API keys are ready.
Yes, but someone should be comfortable connecting Google Sheets and adding an OpenAI API key. The rest is mostly choosing scan options and testing with one small site first.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in OpenAI API costs, which can climb during large crawls.
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
You can change what gets crawled by adjusting the Scan Options setup and the sitemap selection behavior, so you only process pages, posts, or other URL types. Many teams also customize the AI extraction prompt used during page analysis to capture things like meta titles, CTAs, or schema snippets. If you want fresher answers later, keep the agent’s live HTTP tool enabled so it can re-check critical pages on demand.
Usually it’s an expired token or the sheet permissions changed. Reconnect the Google Sheets credential in n8n, confirm the correct Google account is used, and make sure the target spreadsheet is accessible to that account. If it fails only on “append row,” check that your header columns still match what the workflow is trying to write.
It depends on your n8n plan and your server. On self-hosted n8n there’s no execution cap, but crawl size is limited by runtime, rate limits, and OpenAI token costs, especially during the first indexing pass.
For crawling and analysis, n8n is usually the practical choice because you need batching, conditional logic, and a “store it in Sheets then query it later” pattern. Zapier and Make can do pieces of this, but long-running crawls and multi-step processing get awkward fast, and costs can jump when you’re iterating over many URLs. n8n also gives you self-hosting, which matters when you’re running lots of executions. Honestly, the deciding factor is complexity: simple two-app sync, Zapier is fine. For this workflow’s crawl + AI extraction + agent mode setup, n8n fits better. If you want a second opinion, Talk to an automation expert.
Once a site is indexed, the annoying part of auditing goes quiet. You keep the structured sheet, ask better questions, and move faster on the work that actually changes performance.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.