By Christian OFOEFULE
In this new age of ChatGPT, we have seen the capabilities of large language models and how they have taken centre stage, showcasing their remarkable potential with minimal effort.
In this article, we are going to see how to build a simple application that enables us to chat with our documents. The document can be in PDF or text format.
We are going to use some important libraries like Langchain and learn the concepts of vector embeddings and why they are important both for cost savings and reading large files.
The Foundation: Understanding Embeddings and Vector Databases
Unleashing the Power of Embeddings
Vector embeddings are a way of representing textual data in the form of numerical points. With embeddings, we can efficiently capture the semantic meaning of words, phrases, or entire documents. This approach allows us to overcome the limitations of direct text processing and facilitates advanced operations like similarity searches.
The Role of Vector Databases
Vector databases simply store vector embeddings. Examples include Qdrant, Pinecone, and Weaviate as they stand as notable players. These databases provide a structured environment for storing and querying embeddings.
In essence, a cluster in Qdrant can house multiple collections, each functioning as a database. Within these collections, vectors (numerical representations) of data or text are stored as points, enabling seamless similarity searches.
Cost-Efficient AI: Storing and Querying Embeddings
With great power comes a great cost: When using powerful large language models like ChatGPT the cost associated with generating embeddings for every search query can become a significant concern, especially when utilising OpenAI API. The solution lies in the strategic use of vector databases to store and query embeddings efficiently.
Qdrant: Building Clusters and Collections
Qdrant is a powerful vector similarity search engine with a user-friendly API that enables you to effortlessly store, search, and manage vectors along with additional payloads.
In Qdrant, the process begins by creating a cluster, a high-level organizational unit. Within a cluster, multiple collections or databases can coexist. Each collection serves as a repository for vectors, enabling targeted searches within specified data sets. When you embed your text and send it to the Qdrant database, the resulting embeddings become stored as points within the designated collection.
Pinecone and Weaviate: Alternative Vector Database Solutions
Pinecone and Weaviate offer alternative avenues for storing and querying embeddings. With Pinecone’s robust infrastructure or Weaviate’s feature-rich capabilities, developers can explore diverse options based on their specific needs and preferences.
Langchain:
Langchain is a framework for developing applications powered by language models, it provides the libraries and APIs and establishes communication between the models and vector database.
You can install with pip with the following command:
pip install langchain
Code Sample
We will write a simple code sample on how we can chat with our documents. The steps will be divided into 2 sections.
- Storing/Uploading our document as embeddings
- Chatting with the document embeddings using ChatGPT
Step 1: Import Libraries
import os
import qdrant_client
from qdrant_client.http import models
from langchain.vectorstores import Qdrant
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
Step 2: Read the document and split into chunks
loader = UnstructuredPDFLoader(/path-to-pdf-document)
documents = loader.load()
//split file into small chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80)
chunks = text_splitter.split_documents(documents)
Step 3: Create Embedding and Language Model Instances
embeddings = OpenAIEmbeddings()
llm = OpenAI()
Step 4: Create Qdrant Client
client = qdrant_client.QdrantClient(
os.environ["QDRANT_HOST"],
api_key=os.environ["QDRANT_API_KEY"],
)
Step 5: Create Vector Store
vector_store = Qdrant(
client=client,
collection_name=os.environ["QDRANT_COLLECTION_NAME"],
embeddings=embeddings,
)
Step 6: Upload the document chunks as embeddings to Qdrant vector store
vector_store.add_documents(chunks)
Now that we have successfully uploaded our documents as embeddings to a vector store, we can now chat with it. For this, we would be using langchain’s RetrievalQA API
Step 7: Create a RetrievalQA chain type
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever()
)
Step 8: Run the Query and Return the Answer:
question = ‘enter your question about the document’
answer = qa.run(question)
As we chat with our documents using AI, the fusion of cutting-edge technologies empowers developers and businesses to navigate the evolving landscape of conversational AI.
By understanding the nuances of embeddings, harnessing the capabilities of vector databases, and preserving chat history in document databases, we embark on a journey toward AI-driven conversations that are not just intelligent but also seamlessly integrated with the wealth of information stored in our documents.
In the dynamic world of AI-powered chat applications, the convergence of embeddings, vector databases, and language models marks a paradigm shift. By utilising these tools, we open up avenues for efficient, cost-effective, and well-documented AI-powered conversations.
***Christian Ofoefule is a Software Engineer