🔓 Unlock all 10,000+ workflows & prompts free Join Newsletter →
✅ Full access unlocked — explore all 10,000 AI workflow and prompt templates Browse Templates →
Home n8n Workflow
January 22, 2026

Google Sheets + OpenAI: crawl sites, audit faster

Lisa Granqvist Partner Workflow Automation Expert

You open a site audit and immediately drown in tabs. Home, services, blog categories, random landing pages, then another tab for headings, another for links, another for notes you swear you’ll organize later.

This is where site crawl automation helps. SEO analysts usually feel the pain first, but content leads and agency account managers get stuck in the same loop: find page, copy headings, count links, summarize, repeat.

This workflow crawls a site, stores the useful bits in Google Sheets, then lets you ask questions like a chatbot using that saved data. You’ll see what it does, what you need, and where teams typically save a few hours per audit.

How This Automation Works

See how this solves the problem:

n8n Workflow Template: Google Sheets + OpenAI: crawl sites, audit faster

The Challenge: Website audits turn into tab chaos

A “quick” website audit rarely stays quick. You start with one question (“what’s on the services pages?”), then you’re manually checking headings, scanning navigation paths, and pulling links into a doc that nobody ever reuses. The mental load is sneaky too. You’re constantly switching context, which makes you miss obvious issues like duplicate H1s, thin pages hiding in a sitemap, or outdated CTAs. And if you’re doing this for multiple client sites, it becomes the same repetitive grind every week.

It adds up fast. Here’s where it breaks down in real life.

  • You end up copying headings and links by hand, and the moment you paste them into a doc they go stale.
  • Sitemaps are inconsistent across sites, so you waste time hunting for the “right” URLs to review.
  • Audits get stuck at “observations” because summarizing every important page takes forever.
  • When someone asks a follow-up question a week later, you have to redo the crawl because your notes aren’t structured.

The Fix: Crawl once, store in Sheets, then ask questions

This workflow starts as a web consultation chatbot, but the real win is what it remembers. The first time you give it a website URL, it validates the URL with AI, finds the sitemap through robots.txt, and chooses the most relevant sitemap (pages, posts, categories, tags) based on your scan options. Then it crawls the selected URLs in batches, downloads each page’s HTML, converts it to Markdown, and uses OpenAI to extract the language, heading hierarchy, internal links, external links, and a clean summary. Each page becomes a structured row in Google Sheets, so you can filter, sort, and reuse it later. Once indexing is done, the sheet gets flagged as ready, which flips the workflow into “agent mode” for future questions.

After that first crawl, you don’t re-audit from scratch. You ask questions in chat, and the AI agent reads your Google Sheets “memory” to answer quickly. If something needs fresh info, it can also make real-time HTTP requests so you’re not stuck trusting old snapshots.

What Changes: Before vs. After

Real-World Impact

Say you’re auditing a 50-page marketing site. Manually, you might spend about 5 minutes per page to open it, note the H1/H2s, skim for the gist, and grab a couple of key links, which is roughly 4 hours of pure busywork. With this workflow, you submit the URL once, let it crawl in the background, and you’re left with a Google Sheet containing headings, links, and summaries for every page. From there, answering “show me all H1s on services pages” takes minutes, not another afternoon.

Requirements

  • n8n instance (try n8n Cloud free)
  • Self-hosting option if you prefer (Hostinger works well)
  • Google Sheets to store crawl results and “memory”.
  • OpenAI to extract headings, links, and summaries.
  • OpenAI API key (get it from your OpenAI dashboard).

Skill level: Intermediate. You’ll connect accounts, paste API keys, and tweak scan options, but you won’t be writing real code unless you want to.

Need help implementing this? Talk to an automation expert (free 15-minute consultation).

The Workflow Flow

A chat message triggers the workflow. You enter a website URL in the chat interface, and the workflow first checks Google Sheets to see if that site has already been indexed.

It validates the URL and discovers the sitemap. If the site is new, an AI validator confirms the URL is usable, then an HTTP request pulls robots.txt to locate sitemap URLs. Another AI step selects the most relevant sitemap based on your scan options (for example, pages vs. posts).

The crawler processes pages in batches. Each URL is fetched with an HTTP request using a rotating user-agent, the HTML is converted to Markdown, and OpenAI analyzes the content to extract language, heading structure, internal links, external links, and a summary. Those results are appended into Google Sheets row-by-row.

Then it switches into “agent mode” for questions. Once the sheet is flagged as indexed, the LangChain-style agent reads your stored rows to answer questions like “what’s on the contact page?” It can also do live HTTP checks when you need up-to-date info.

You can easily modify which URL types get scanned (pages, posts, categories) to match your audit style. See the full implementation guide below for customization options.

Step-by-Step Implementation Guide

Step 1: Configure the Web Chat Trigger

Set up the inbound chat endpoint that starts the workflow and passes user input into the AI validation and conversation paths.

  1. Add and open Web Chat Trigger.
  2. Set Public to true and Authentication to basicAuth.
  3. Credential Required: Connect your httpBasicAuth credentials.
  4. Confirm that Web Chat Trigger outputs to Retrieve Data Schema as the first step in the flow.

Step 2: Connect Google Sheets

These nodes read and write the website content database used by the assistant and by the summarization process.

  1. Open Retrieve Data Schema and select the spreadsheet in Document ID and Sheet Name (e.g., gid=0).
  2. In Retrieve Data Schema, ensure the filter uses Lookup Column Data schema with Lookup Value ={{ true }}.
  3. Credential Required: Connect your googleSheetsOAuth2Api credentials in Retrieve Data Schema.
  4. Open Schema Flag Update, set Operation to appendOrUpdate, and map Data schema to ={{true}}.
  5. Credential Required: Connect your googleSheetsOAuth2Api credentials in Schema Flag Update.
  6. Open Append Sheet Row, keep Operation as append, and confirm column mappings use the AI expressions like ={{ /*n8n-auto-generated-fromAI-override*/ $fromAI('Summary_Content', ``, 'string') }}.
  7. Credential Required: Connect your googleSheetsOAuth2Api credentials in Append Sheet Row.

Tip: Fetch Sheet Rows and Append Sheet Row are AI tools. Their credentials are configured directly on those nodes, but they are used by Conversational Agent and Page Summary Model as tools.

Step 3: Set Up URL Validation and Schema Checks

This section validates user URLs, checks whether the schema is already present, and routes into either the chat response or sitemap scanning flow.

  1. Open Schema Check and confirm the condition uses ={{ $json['Data schema'] }} with a boolean true check.
  2. Open URL Validator Agent and set Text to ={{ $node["Web Chat Trigger"].json["chatInput"] }}.
  3. Attach Structured Parser to URL Validator Agent as the output parser and keep the JSON schema example:

{ "URL": "example.com", "URL_bool": true }

  1. Connect OpenAI Validator Model as the language model for URL Validator Agent.
  2. Credential Required: Connect your openAiApi credentials in OpenAI Validator Model (credentials are set on the model node, not the agent).
  3. Open URL Valid Check and confirm the condition uses ={{ $json.output.URL_bool }} with a boolean true check.
  4. For invalid URLs, ensure Invalid URL Reply sends the message Debes introducir una URL válida ejemplo: https://google.es.

⚠️ Common Pitfall: If the URL validator returns a non-boolean in URL_bool, URL Valid Check will not pass. Keep the structured parser schema intact.

Step 4: Configure the Conversational Agent and Memory

This agent answers chat requests using the website data in Google Sheets and a language model, with memory enabled for context.

  1. Open Conversational Agent and set Text to ={{ $node["Web Chat Trigger"].json["chatInput"] }}.
  2. Connect OpenAI Chat Engine as the language model for Conversational Agent.
  3. Credential Required: Connect your openAiApi credentials in OpenAI Chat Engine.
  4. Attach Lightweight Memory to Conversational Agent and keep Context Window Length at 50.
  5. Attach Fetch Sheet Rows as an AI tool to Conversational Agent so it can retrieve website content from the sheet.
  6. Credential Required: Connect your googleSheetsOAuth2Api credentials in Fetch Sheet Rows (tools use their own credentials, but are invoked by the agent).
  7. Confirm Conversational Agent routes to Agent Response and that Agent Response uses ={{ $json.output }} for its message.

Step 5: Configure Sitemap Discovery and Selection

These nodes build a valid sitemap URL from robots.txt, fetch the sitemap, and select the correct sitemap section using AI.

  1. Open Scan Options and keep the boolean flags as configured: scan_pages true, scan_posts false, category false, tags false.
  2. In Random UA Picker, keep the provided JavaScript that returns a random userAgent string.
  3. Open Robots.txt Request and set URL to ={{ $node["URL Validator Agent"].json["output"]["URL"] }}/robots.txt.
  4. Confirm Parse Sitemap URL parses the robots.txt content and returns sitemapUrl as coded.
  5. Open Sitemap Fetch and set URL to ={{ $json.sitemapUrl }} with headers including User-Agent ={{ $json.userAgent }}.
  6. Open XML Decode A to parse the sitemap XML and pass it to Sitemap Selector.
  7. Open Sitemap Selector, keep JSON Output enabled, and verify it references scan options and the sitemap index locs using expressions like {{ $json.sitemapindex.sitemap[0].loc }}.
  8. Credential Required: Connect your openAiApi credentials in Sitemap Selector.
  9. Open Page Sitemap Fetch and set URL to ={{ $json.message.content.sitemap_page }}, then route to XML Decode B.

⚠️ Common Pitfall: If Parse Sitemap URL returns null, Sitemap Fetch will fail and trigger Sitemap Failure Halt. Ensure robots.txt contains a valid Sitemap: line.

Step 6: Configure URL Splitting, Page Requests, and Summarization

This stage expands the selected sitemap into individual URLs, fetches each page, converts HTML to Markdown, and summarizes the content with AI.

  1. Open Combine URL Map and keep the JavaScript that transforms urlset.url into a urls object.
  2. Open Split URL Items and set Field To Split Out to urls.
  3. Open Batch Iterator to control batch processing of URLs before requesting pages.
  4. Open Page Request and set URL to ={{ $('Split URL Items').item.json.urls }} with User-Agent header ={{ $json.userAgent }}.
  5. Open HTML Convert to MD and set HTML to ={{ $json.data }}.
  6. Open Page Summary Model and ensure the message includes =URL: {{ $('Split URL Items').item.json.urls }} {{ $json.data }} in the prompt.
  7. Credential Required: Connect your openAiApi credentials in Page Summary Model.
  8. Confirm Page Summary Model uses Append Sheet Row as an AI tool to write results to the sheet.

Tip: If pages return blank content, check Page Request headers and the User-Agent from Random UA Picker.

Step 7: Add Error Handling

The workflow includes explicit stops for invalid URLs and sitemap failures.

  1. Verify Request Error Halt is connected from the error output of Robots.txt Request and uses the message URL mal introducida, debes introducir con el siguiente formato: ejemplo.com.
  2. Verify Sitemap Failure Halt is connected from the error output of Sitemap Fetch and uses the message Sitemap no encontrado o acceso bloqueadp.

⚠️ Common Pitfall: If Sitemap Fetch returns a 403/404, confirm the sitemap URL from Parse Sitemap URL and whether the site blocks automated requests.

Step 8: Test and Activate Your Workflow

Run a controlled test to validate the end-to-end chat response and sitemap summarization flow.

  1. Click Execute Workflow and send a chat message through Web Chat Trigger with a URL like https://example.com.
  2. Confirm a valid URL routes through URL Valid Check to Scan Options, Robots.txt Request, and Sitemap Fetch.
  3. Verify that Page Summary Model runs and Append Sheet Row adds a row with Lang, Page URL, and Summary Content.
  4. Check that Agent Response returns a chat reply when Schema Check is true and that Invalid URL Reply triggers for invalid URLs.
  5. Once validated, switch the workflow to Active for production use.
🔒

Unlock Full Step-by-Step Guide

Get the complete implementation guide + downloadable template

Watch Out For

  • Google Sheets credentials can expire or need specific permissions. If things break, check the n8n credential connection and the sheet sharing settings first.
  • If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
  • Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.

Common Questions

How quickly can I implement this site crawl automation automation?

About 30 minutes if your accounts and API keys are ready.

Can non-technical teams implement this site crawl automation?

Yes, but someone should be comfortable connecting Google Sheets and adding an OpenAI API key. The rest is mostly choosing scan options and testing with one small site first.

Is n8n free to use for this site crawl automation workflow?

Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in OpenAI API costs, which can climb during large crawls.

Where can I host n8n to run this automation?

Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.

How do I adapt this site crawl automation solution to my specific challenges?

You can change what gets crawled by adjusting the Scan Options setup and the sitemap selection behavior, so you only process pages, posts, or other URL types. Many teams also customize the AI extraction prompt used during page analysis to capture things like meta titles, CTAs, or schema snippets. If you want fresher answers later, keep the agent’s live HTTP tool enabled so it can re-check critical pages on demand.

Why is my Google Sheets connection failing in this workflow?

Usually it’s an expired token or the sheet permissions changed. Reconnect the Google Sheets credential in n8n, confirm the correct Google account is used, and make sure the target spreadsheet is accessible to that account. If it fails only on “append row,” check that your header columns still match what the workflow is trying to write.

What’s the capacity of this site crawl automation solution?

It depends on your n8n plan and your server. On self-hosted n8n there’s no execution cap, but crawl size is limited by runtime, rate limits, and OpenAI token costs, especially during the first indexing pass.

Is this site crawl automation automation better than using Zapier or Make?

For crawling and analysis, n8n is usually the practical choice because you need batching, conditional logic, and a “store it in Sheets then query it later” pattern. Zapier and Make can do pieces of this, but long-running crawls and multi-step processing get awkward fast, and costs can jump when you’re iterating over many URLs. n8n also gives you self-hosting, which matters when you’re running lots of executions. Honestly, the deciding factor is complexity: simple two-app sync, Zapier is fine. For this workflow’s crawl + AI extraction + agent mode setup, n8n fits better. If you want a second opinion, Talk to an automation expert.

Once a site is indexed, the annoying part of auditing goes quiet. You keep the structured sheet, ask better questions, and move faster on the work that actually changes performance.

Need Help Setting This Up?

Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.

Lisa Granqvist

Workflow Automation Expert

Expert in workflow automation and no-code tools.

×

Use template

Get instant access to this n8n workflow Json file

💬
Get a free quote today!
Get a free quote today!

Tell us what you need and we'll get back to you within one working day.

Get a free quote today!
Get a free quote today!

Tell us what you need and we'll get back to you within one working day.

Launch login modal Launch register modal