Web Scrape QnA

Let's say you have a website (could be a store, an ecommerce site, a blog), and you want to scrap all the relative links of that website and have LLM answer any question on your website. In this tutorial, we are going to go through how to achieve that.

Upsert Flow

This flow is used to upsert all information from a website to a vector database, then have LLM answer user's question by looking up from the vector database.

You can find the example flow called - Conversational Retrieval QA Chain from the marketplace templates.

Here, we are going to use Cheerio Web Scraper node to scrape links from a given URL. Also replacing RecursiveTextSplitter to HtmlToMarkdown for cleaner data preparation.

If you do not specify anything, by default only the given URL page will be scraped. If you want to crawl the rest of relative links, click Additional Parameters.

  • Get Relative Links Method - how to crawl all relative links, Web Crawl or Sitemap

  • Get Relative Links Limit - how many links to crawl, set 0 to crawl all

When you open the chat and start asking question, all links will be scraped and upserted into vector database (Pinecone in this case).

From the console/terminal, you can see all the links that are being scraped:

Navigate to Pinecone dashboard, you will be able to see new vectors being added under the namespace you have specified in the flow.

Few things to keep in mind

  • Once the flow is set, documents will only get upserted once when user start asking first question from the UI or API or Embedded Chat.

  • Documents will not get upserted again whenever user ask another question.

  • The only condition where documents get upserted again is when the flow configuration (like different file, different models, different pinecone index, etc) have changed, and you have to save the chatflow again.

  • In the other words, we store the flow state, and if a new save is done, we check if existing state == new state, if not, do upsert, if yes, ignore.

  • However, sometimes you might want to change some settings like metadata but you don't want another upsert to be done again.

  • Therefore, it is generally recommended to create another flow to load the existing index from vector store.

  • How to solve TypeError: Cannot read properties of undefined

    You've made this chatbot for an website. Everything was working fine. But after restarting this issue appeared

Understanding the exact cause of this issue can be quite intricate, so let's focus on the solution for now. As you follow these steps, the underlying reasons should become clearer.

  1. Restarting and the Issue: You've observed that restarting seems to lead to the problem. While this action appears problematic, it's not necessarily a fault in the true sense.

  2. Checking Pinecone:

    • Begin by heading to the Pinecone dashboard.

    • If you started with an empty Pinecone index, you should still see vectors from before the restart.

  3. Adjusting in the Flow:

    • If the above vectors are visible, return to the flow.

    • Here, remove the splitter and scraper nodes.

  4. Modifying Pinecone Nodes:

    • Replace the "Pinecone Upsert Document" node with the "Pinecone Load Existing Index" node.

    • Ensure you update the parameters of this new node, especially if you've assigned specific names to your index or have added metadata during your scrape.

  5. Finalizing the Setup:

    • Conclude by saving your changes and refreshing the system. After this, everything should be back on track!

Load Existing Index Flow

This flow is used to load an existing index/collection from vector store, typically after you have upserted the documents to that particular index/collection.

If you have specified namespace or metadata from the upsert flow, remember to specify here as well, under the Additional Parameters in Pinecone node.

It is recommended to specify a system message for the Conversational Retrieval QA Chain. For example, you can specify the name of AI, the language to answer, the response when answer its not found (to prevent hallucination).

I want you to act as a document that I am having a conversation with. Your name is "AI Assistant". You will provide me with answers from the given info. If the answer is not included, say exactly "Hmm, I am not sure." and stop after that. Refuse to answer any question not about the info. Only answer in English. Never break character.

That's it! You can start asking question 🤔

You can also turn on the Return Source Documents option to return a list of document chunks where the AI's response is coming from.

The same logic can be applied to any document use cases, not just limited to web scraping.

If you have any suggestion on how to improve the performance, we'd love your contribution!

Last updated