Translations:FACTS About Building Retrieval Augmented Generation-based Chatbots/31/ja: Difference between revisions

Latest revision as of 07:13, 20 February 2025

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (FACTS About Building Retrieval Augmented Generation-based Chatbots)

'''Handling multi-modal data''': Enterprise data is multi-modal. Handling structured, unstructured, and multi-modal data is crucial for a versatile RAG pipeline. From our experience, if the structure of the document is consistent and known apriori (like those found in EDGAR databases for SEC filings data in financial earnings domain that Scout bot was handling), implementing section-level splitting, using the section titles and subheadings and incorporating those in the context of chunks improves retrieval relevancy. We also found solutions like Unstructured.io, which specialize in extracting and structuring content from PDFs, helpful in parsing and chunking unstructured documents with context.

マルチモーダルデータの処理: エンタープライズデータはマルチモーダルです。構造化データ、非構造化データ、マルチモーダルデータを処理することは、多用途なRAGパイプラインにとって重要です。我々の経験から、文書の構造が一貫しており、事前に知られている場合（例えば、Scoutボットが扱っていた金融収益分野のSEC提出データのEDGARデータベースに見られるようなもの）、セクションレベルでの分割を実施し、セクションタイトルや小見出しを使用してそれらをチャンクのコンテキストに組み込むことで、検索の関連性が向上します。また、Unstructured.ioのような、PDFからコンテンツを抽出し構造化することに特化したソリューションが、非構造化文書をコンテキストを持たせて解析しチャンク化する際に役立つこともわかりました。