Translations:FACTS About Building Retrieval Augmented Generation-based Chatbots/31/ko: Difference between revisions

Latest revision as of 07:19, 20 February 2025

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (FACTS About Building Retrieval Augmented Generation-based Chatbots)

'''Handling multi-modal data''': Enterprise data is multi-modal. Handling structured, unstructured, and multi-modal data is crucial for a versatile RAG pipeline. From our experience, if the structure of the document is consistent and known apriori (like those found in EDGAR databases for SEC filings data in financial earnings domain that Scout bot was handling), implementing section-level splitting, using the section titles and subheadings and incorporating those in the context of chunks improves retrieval relevancy. We also found solutions like Unstructured.io, which specialize in extracting and structuring content from PDFs, helpful in parsing and chunking unstructured documents with context.

다중 모드 데이터 처리: 기업 데이터는 다중 모드입니다. 구조화된, 비구조화된, 다중 모드 데이터를 처리하는 것은 다재다능한 RAG 파이프라인에 필수적입니다. 우리의 경험에 따르면, 문서의 구조가 일관되고 사전에 알려져 있는 경우(Scout 봇이 처리했던 금융 수익 분야의 SEC 제출 데이터에 대한 EDGAR 데이터베이스에서 발견되는 것처럼), 섹션 제목과 부제목을 사용하여 섹션 수준의 분할을 구현하고 이를 청크의 맥락에 통합하면 검색 관련성이 향상됩니다. 또한, Unstructured.io와 같은 솔루션은 PDF에서 콘텐츠를 추출하고 구조화하는 데 특화되어 있어 비구조화된 문서를 맥락과 함께 구문 분석하고 청크화하는 데 유용하다는 것을 발견했습니다.