Do you construct GenAI programs and wish to deploy them, or do you simply wish to be taught extra about FastAPI? Then that is precisely what you had been in search of! Simply think about you’ve gotten plenty of PDF experiences and wish to seek for particular solutions in them. Both you may spend hours scrolling, or you may construct a system that reads them for you and solutions your questions. We’re constructing a RAG system that shall be deployed and accessed via an API utilizing FastAPI. So with none additional ado, letâs dive in.
What’s FastAPI?
FastAPI is a Python framework for constructing API(s). FastAPI lets us use HTTP strategies to speak with the server.
Certainly one of its helpful options is that it auto-generates documentation to your APIs you create. After writing your code and creating the APIs, you may go to a URL and make the most of the interface (Swagger UI) to check your endpoints with out even requiring you to code the frontend.
Understanding REST APIs
A REST API is an interface that creates communication between the shopper and server. REST API is brief for Representational State Switch API. The shopper can ship HTTP requests to a selected API endpoint, and the server processes these requests. There are fairly just a few HTTP strategies current. A number of of which we shall be implementing in our venture utilizing FastAPI.
HTTP Strategies:

In our venture, we are going to use two strategies to speak:
- GET: That is used to retrieve data. We are going to use /well being GET request to verify if the server is working.
- POST: That is used to ship information to the server to create or course of one thing. We are going to use /ingest and /question POST requests. We use POST right here as a result of they contain sending advanced information like recordsdata or JSON objects. Extra about this within the implementation part.
What’s RAG?
Retrieval-Augmented Era (RAG) is one strategy to give an LLM entry to particular information it wasnât initially educated on.
RAG parts:
- Retrieval: Discovering related sentences from the doc(s) primarily based on the question.
- Era: Passing these sentences to an LLM so it might summarize them into a solution.
Letâs perceive extra in regards to the RAG within the upcoming implementation part.
Implementation
Downside Assertion: Making a system that permits customers to add paperwork, particularly .txt recordsdata or PDFs. Then it indexes them right into a searchable database and ensures that an LLM can reply questions in regards to the new information. This method shall be deployed and used via API endpoints that we’ll create via FastAPI.
Pre-Requisites
â We would require an OpenAI API Key, and we are going to use the gpt-4.1-mini mannequin because the mind of the system. You will get your arms on the API key from the hyperlink: (https://platform.openai.com/settings/organization/api-keys)
â An IDE for executing the Python scripts, Iâll be utilizing VSCode for the demo. Create a brand new venture (folder).
â Make an .env file in your venture and add your OpenAI key precisely like:
OPENAI_API_KEY=sk-proj...
â Create a Digital Surroundings for This Undertaking (To isolate the ventureâs dependencies).

Be aware:
- Make sure that the fast_env is created in your venture, as path errors might happen if the working listing shouldn’t be set to the venture listing..
- As soon as activated, any packages you put in shall be contained inside this atmosphere.
 â Obtain the weblog beneath as a PDF utilizing the âobtain iconâ to make use of in our RAG system:
Necessities
To resolve this, we want a stack that handles heavy lifting effectively:
- FastAPI: To deal with the net requests and file uploads.
- LangChain: To increase the capabilities of the LLM.
- FAISS (Fb AI Similarity Search): Helps search via textual content chunks. We are going to use it as a vector database.
- Uvicorn: To host the server.
You may create a necessities.txt in your venture and run âpip set up -r necessities.txtâ:
fastapi==0.129.0
uvicorn[standard]==0.41.0
python-multipart==0.0.22
langchain==1.2.10
langchain-community==0.4.1
langchain-openai==1.1.10
langchain-core==1.2.13
faiss-cpu==1.13.2
openai==2.21.0
pypdf==6.7.1
python-dotenv==1.2.1
Implementation Method
We are going to implement two FastAPI endpoints:
1. The Ingestion Pipeline (/ingest)
When a person uploads a file, we make use of the RecursiveCharacterTextSplitter from LangChain. This perform breaks lengthy paperwork into smaller chunks (we are going to configure the perform to make every chunk measurement as 500 characters).
These chunks are then transformed into embeddings and saved in our FAISS index (vector database). We are going to use the native storage for FAISS in order that even when the server restarts, the uploaded paperwork arenât misplaced.
2. The Question Pipeline (/question)
If you ask a query, the query turns right into a vector. We then use FAISS to retrieve the highest okay (normally 4) chunks of textual content which might be most just like the query.
Lastly, we use LCEL (LangChain Expression Language) to implement the Era element of the RAG. We ship the query and people 4 chunks to gpt-4.1-mini together with our immediate to get the reply.
Python Code
In the identical venture folder, create two scripts, rag_pipeline.py and main.py:
rag_pipeline.py:
Imports
import os
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.paperwork import Doc
from dotenv import load_dotenv
from typing import Listing
Configuration
# Loading OpenAI API key
load_dotenv()
# Config
FAISS_INDEX_PATH = "faiss_index"
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4.1-mini"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
Be aware: Guarantee you’ve gotten added the API key within the .env file
Initializations and Defining the Capabilities
# Shared state
_vectorstore: FAISS | None = None
embeddings = OpenAIEmbeddings(mannequin=EMBEDDING_MODEL)
def _load_vectorstore() -> FAISS | None:
"""Load present FAISS index from disk if it exists."""
world _vectorstore
if _vectorstore is None and os.path.exists(FAISS_INDEX_PATH):
_vectorstore = FAISS.load_local(
FAISS_INDEX_PATH,
embeddings,
allow_dangerous_deserialization=True
)
return _vectorstore
def ingest_document(file_path: str, filename: str = "") -> int:
"""
Chunks, Embeds, Shops in FAISS and returns the variety of chunks saved.
"""
world _vectorstore
# 1. Load
if file_path.endswith(".pdf"):
loader = PyPDFLoader(file_path)
else:
loader = TextLoader(file_path)
paperwork = loader.load()
# Overwriting supply with the filename
display_name = filename or os.path.basename(file_path)
for doc in paperwork:
doc.metadata["source"] = display_name
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["nn", "n", ".", " ", ""]
)
chunks = splitter.split_documents(paperwork)
# 3. Embed and Retailer
if _vectorstore is None:
_load_vectorstore()
if _vectorstore is None:
_vectorstore = FAISS.from_documents(chunks, embeddings)
else:
_vectorstore.add_documents(chunks)
# 4. Persist to disk
_vectorstore.save_local(FAISS_INDEX_PATH)
return len(chunks)
def _format_docs(docs: Listing[Document]) -> str:
"""Concatenate doc page_content so as to add to the immediate."""
return "nn".be part of(doc.page_content for doc in docs)
These capabilities assist chunking the paperwork, break up the textual content into embeddings (utilizing the embedding mannequin: text-embedding-3-small) and retailer them within the FAISS index (vector retailer).
Defining the Retriever and Generator
def query_rag(query: str, top_k: int = 4) -> dict:
"""
Returns reply textual content and supply references.
"""
vs = _load_vectorstore()
if vs is None:
return {
"reply": "No paperwork have been ingested but. Please add a doc first.",
"sources": []
}
# Retriever
retriever = vs.as_retriever(
search_type="similarity",
search_kwargs={"okay": top_k}
)
# Immediate
immediate = PromptTemplate(
input_variables=["context", "question"],
template="""You're a useful assistant. Use solely the context beneath to reply the query.
If the reply shouldn't be within the context, say "I do not know primarily based on the offered paperwork."
Context:
{context}
Query: {query}
Reply:"""
)
llm = ChatOpenAI(mannequin=LLM_MODEL, temperature=0)
# LCEL chain
# Step 1:
retrieve = RunnableParallel(
_format_docs,
"query": RunnablePassthrough(),
)
# Step 2:
answer_chain = immediate | llm | StrOutputParser()
# Invoke
retrieved = retrieve.invoke(query)
reply = answer_chain.invoke(retrieved)
# Extracting sources
sources = record({
doc.metadata.get("supply", "unknown")
for doc in retrieved["source_documents"]
})
return {
"reply": reply,
"sources": sources,
}
We now have carried out our RAG, which retrieves 4 paperwork utilizing similarity search and passes the query, context, and immediate to the Generator (gpt-4.1-mini).
First, the related paperwork are fetched utilizing the question, after which the answer_chain is invoked which solutions the query as a string utilizing the StrOutputParser().Â
Be aware: The highest-k and query shall be handed as arguments to the perform.
main.py
Imports
import os
import tempfile
from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
from rag_pipeline import ingest_document, query_rag
We now have imported the ingest_document and query_rag capabilities, which shall be utilized by the API Endpoints we are going to outline.
Configuration
app = FastAPI(
title="RAG API",
description="Add paperwork and question them utilizing RAG",
model="1.0.0"
)
ALLOWED_EXTENSIONS = {
"software/pdf": ".pdf",
"textual content/plain": ".txt",
}
class QueryRequest(BaseModel):
query: str
top_k: int = 4
class QueryResponse(BaseModel):
reply: str
sources: record[str]
Utilizing Pydantic to strictly outline the construction of inputs to the API.
 Be aware: Validators will be added right here as nicely to carry out sure checks (instance: to verify if the telephone quantity is precisely 10 digits)
/well being API
@app.get("/well being", tags=["Health"])
def well being():
"""Verify if the API is working."""
return {"standing": "okay"}
This API is beneficial to substantiate if the server is working.
Be aware: We wrap the API capabilities with a decorator; right here, we use @app as a result of we had initialized FastAPI with this variable earlier. Additionally, itâs adopted by the HTTP technique, right here it’s get(). We then go the trail for the endpoint, which is â/well beingâ right here.
 /ingest API (To take the doc from the person)
@app.submit("/ingest", tags=["Ingestion"], abstract="Add and index a doc")
async def ingest(file: UploadFile = File(...)):
"""
Add a **.txt** or **.pdf** file.
"""
if file.content_type not in ALLOWED_EXTENSIONS:
increase HTTPException(
status_code=400,
element=f"Unsupported file kind '{file.content_type}'. Solely .txt and .pdf are supported."
)
suffix = ALLOWED_EXTENSIONS[file.content_type]
contents = await file.learn()
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(contents)
tmp_path = tmp.identify
strive:
num_chunks = ingest_document(tmp_path, filename=file.filename)
besides Exception as e:
increase HTTPException(status_code=500, element=str(e))
lastly:
os.unlink(tmp_path)
return {
"message": f"Efficiently ingested '{file.filename}'",
"chunks_indexed": num_chunks
}
This perform ensures solely .txt or .pdf is accepted after which calls the ingest_document() perform outlined in rag_pipeline.py script.
/question API (To run the RAG pipeline)
@app.submit("/question", response_model=QueryResponse, tags=["Query"], abstract="Ask a query about your paperwork")
def question(request: QueryRequest):
"""
Ask a query associated to the offered doc.
The pipeline will return the reply and the supply file names used to generate it.
"""
if not request.query.strip():
increase HTTPException(status_code=400, element="Query can't be empty.")
strive:
consequence = query_rag(request.query, request.top_k)
besides Exception as e:
increase HTTPException(status_code=500, element=str(e))
return QueryResponse(reply=consequence["answer"], sources=consequence["sources"])
Lastly, we outlined the API that calls the query_rag() perform and returns the response in response to the paperwork to the person. Letâs shortly take a look at it.
Operating the App
â Run the beneath command in your command immediate or terminal:
uvicorn principal:app --reload
Be aware: Guarantee your atmosphere is activated and all of the dependencies are put in. Or else you would possibly see errors associated to the identical.
â Now the app needs to be up and working right here:
â Open Swagger UI (Interface) utilizing the URL beneath:
/docs

Nice! We will take a look at our APIs utilizing the interface simply by passing the arguments to the APIs.
Testing Each the APIs
1. /ingest API:

Click on on âStrive it outâ and go the demo.pdf (you may substitute it with another PDF as nicely). And click on execute.

Nice! The API processed our request and created the vector retailer utilizing the PDF. You may confirm the identical by taking a look at your venture folder, the place you may see the brand new faiss_index folder.
2. /question API:

Now, click on on Strive it Out and go the arguments beneath (Be at liberty to make use of totally different prompts and PDFs).
{
"query": "Identify 3 purposes of Machine Studying",
"top_k": 4
}

As anticipated, the response appears very associated to the content material within the PDF. You may go forward and play with the top-k parameter and likewise check it out with totally different questions.

Understanding HTTP Standing Codes
HTTP standing codes inform the shopper whether or not a request was profitable or if one thing went mistaken.
Standing Code Classes:
Success
*The request was efficiently obtained and processed.

In our venture:
- /well being returns 200 OK when the server is working.
- /ingest and /question return 200 OK when profitable.
Consumer Errors
*The error is brought on by one thing the shopper despatched.

In our venture:
- For those who add an surprising file kind (not a PDF or txt file), the API returns standing code 400.
- If the query is empty in /question, the API returns the standing code 400.
- FastAPI returns standing code 422 if the request physique doesn’t match the anticipated Pydantic mannequin that we outlined.
Server Errors
*They point out one thing went mistaken on the server facet.

In our venture:
- If ingestion or querying code fails as a consequence of FAISS error or OpenAI error the API returns the standing code 500.
Additionally Learn:
Conclusion
We efficiently carried out and learnt to construct and deploy a RAG system utilizing FastAPI. Right here we created an API that ingests PDFs/.txtâs, retrieves related data, and generates related solutions. The deployment half makes GenAI programs or conventional ML programs straightforward to entry in real-world purposes. We will additional enhance our RAG by optimizing the chunking technique and mixing totally different retrieval strategies for our queries
Regularly Requested Questions
âreload makes the FastAPI server auto-restart at any time when code adjustments, reflecting updates with out manually restarting the server.
We use POST as a result of queries embody structured information like JSON objects. These will be massive and sophisticated. These are in contrast to GET requests that are used for easy retrievals.
MMR (Maximal Marginal Relevance) balances relevance and variety when deciding on doc chunks, guaranteeing retrieved outcomes are helpful with out being redundant.
Growing top_k retrieves extra chunks for the LLM, which might result in potential noise in generated solutions because of the presence of irrelevant content material.
Â
Login to proceed studying and revel in expert-curated content material.
