PDF Analysis and Querying with Generative AI

More from Author
Akash

Engineering Lead

8 min read

In today's digital age, information is abundant, and the ability to efficiently extract knowledge from vast amounts of text has become increasingly valuable. Traditional methods of manual extraction and analysis can be time-consuming and labor-intensive. However, with the advancements in Generative AI, we now have powerful tools that can automate the process of reading and understanding text, opening up new possibilities for knowledge extraction and retrieval.

In this blog, we will explore how Generative AI can be leveraged to read a PDF and create a custom knowledge base that allows us to query its contents. We will delve into the fascinating world of natural language processing and machine learning to demonstrate how these technologies can be applied to one of our favourite books, "The Book Thief" by Markus Zusak.

"The Book Thief" is a captivating novel set during World War II, narrated by Death itself (ikr). It tells the story of Liesel Meminger, a young girl living in Germany, and her extraordinary journey through the power of words and storytelling. This rich and compelling narrative will serve as the foundation for our exploration of Generative AI and its application in analyzing textual content.

So, let's embark on this journey into the realm of Generative AI and discover how it can revolutionize the way we read, analyze, and interact with PDFs, ultimately unraveling the profound beauty of "The Book Thief" and other literary works.

Prerequisite: Setting Up the Working Environment

Before we embark on our journey, let's ensure we have the necessary tools to leverage Generative AI for text summarization.

1. Create a Virtual Environment: We start by isolating our project dependencies in a virtual environment using the command:

Copy Code
    
 python -m venv venv
    
    

2. Activate the Virtual Environment: Depending on the operating system, use the appropriate command to activate the virtual environment.

Copy Code
    
 For Windows: .\venv\Scripts\activate
 For macOS/Linux: source venv/bin/activate
    
    

3. Install Required Packages: With the virtual environment active, install the necessary packages using the commands:

Copy Code
    
 pip install openai langchain tiktoken pinecone pypdf.
    
    

With these steps completed, our working environment is now set up and ready to explore the power of Generative AI in text summarization.

Copy Code
          
 OPENAI_API_KEY = '<ADD YOUR OPENAI API KEY HERE😛>'
 PINECONE_API_KEY = '<ADD YOUR PINECONE API KEY HERE😛>'
 PINECONE_API_ENV  = '<ADD YOUR API ENV HERE😛> like us-west4-gcp-free'
          
      

1. Loading the PDF:

To begin, we need to load the PDF into our system. We can use the PyPDFLoader module from the LangChain library to accomplish this. Here's an example of how to load the PDF:

Copy Code
        
 from langchain.document_loaders import PyPDFLoader

 # Load the book
 loader = PyPDFLoader("The-Book-Thief.pdf")
 pages = loader.load()
        
    

2. Splitting the Text:

Since the entire book is now loaded as a single document, we need to split it into smaller chunks for processing and querying. The LangChain library provides a RecursiveCharacterTextSplitter that can handle this task. We can define the chunk size and overlap based on our requirements. Here's an example of splitting the book into smaller texts:

Copy Code
        
 from langchain.text_splitter import RecursiveCharacterTextSplitter

 text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n",
 "\t"], chunk_size=5000, chunk_overlap=200)
 texts = text_splitter.split_documents(pages)

 num_documents = len(texts)
 print(f"Now our book is split up into {num_documents} documents")
        
      

3. Generating Embeddings:

To perform efficient queries on the text, we need to generate embeddings for each chunk. Embeddings capture the semantic meaning of the text, enabling us to find similar or relevant chunks efficiently. We can use the OpenAIEmbeddings module from LangChain to generate embeddings. Here's an example:

Copy Code
        
 from langchain.embeddings import OpenAIEmbeddings
 embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
        
      

4. Creating a Knowledge Base Index:

To enable fast and accurate querying, we will create an index of our knowledge base using Pinecone, a vector search engine. Pinecone allows us to store and retrieve embeddings efficiently. Here's an example of how to create the index:

Copy Code
          
 from langchain.vectorstores import Pinecone
 import pinecone

 # Initialize Pinecone
 pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_API_ENV)
 index_name = "the-book-thief"

 # Create the index
 docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)
          
        

5. Querying the Knowledge Base:

Copy Code
          
 from langchain.chains import RetrievalQA
  
 index_name = "the-book-thief"
 text_field = "text"
 index = pinecone.Index(index_name)
 vectorstore = Pinecone(
 index, embeddings.embed_query, text_field
 )
  
 query = "Provide a rating and a review for this book"
  
 docs = vectorstore.similarity_search(query, k=3)
  
 qa = RetrievalQA.from_chain_type(
       llm=llm,
       chain_type="stuff",
       retriever=vectorstore.as_retriever()
 )
  
 output = qa.run(query)
 print(output)
          
        

We threw a bunch of questions at the AI, some straightforward and others philosophical. And guess what? Here are the cool answers it came up with!

Q1: Who is the author of the book thief?

The author of The Book Thief is Markus Zusak.


Q2. Who is the main character of this book?

The main character of this book is Liesel, a young German girl whose book-stealing and story-telling talents help sustain her family and the Jewish man they are hiding, as well as their neighbors.


Q3: Who is the narrator of this book?

The narrator of the book is Death.


Q4: How many times did Death see the book thief

Death saw the book thief many times.

Not right though: Page 8 of the book states that I saw the book thief three times.

Q5: Describe the scenarios when the Death met the book thief

There are two scenarios mentioned in the given context when Death met the book thief.

1. The first scenario is when Death came to take away Liesel's brother and saw her for the first time. This happened at the beginning of the book when Liesel's brother died on a train journey, and Death came to collect his soul. Death noticed Liesel and was struck by her reaction to her brother's death. This was the first time Death saw Liesel, and he would see her again in the future.

2. The second scenario is when Death came to take Liesel away at the end of the book. This happened when Liesel was an old woman, and Death came to collect her soul. Before taking her away, Death gave Liesel a book that he had been carrying with him for a long time. They sat down together, and Liesel read the book while Death watched her. After Liesel finished reading, Death took her away.

The Plane Crash incident is not mentioned.


Q6: What is the first book Liesel steals?

The context does not provide information about the first book Liesel steals. It only mentions the second stolen book, which is called "The Shoulder Shrug".

Liesel takes The Grave Digger's Handbook from the snow after her brother's burial.


Q7: Discuss Liesel's friendship with Rudy. Does she love him in the way he loves her, or is it a child's love?

Liesel's friendship with Rudy is a significant part of the novel. They become friends due to their shared experiences of poverty and hunger, and their friendship is cemented by their stealing. Liesel and Rudy's friendship is a deep and meaningful one, and they care for each other deeply. However, it is unclear whether Liesel loves Rudy in the same way that he loves her. It is possible that Liesel has romantic feelings for Rudy, but it is also possible that their relationship is more of a child's love, based on their shared experiences and mutual affection. Regardless of the nature of their love, their friendship is a crucial part of the novel, and their bond is a testament to the power of human connection in the face of adversity.


Q8: Provide a brief summary of this book

The Book Thief is a novel by Markus Zusak that tells the story of Liesel Meminger, a young girl living in Nazi Germany during World War II. The book is narrated by Death, who tells the story of Liesel's life as she grows up in a foster home and learns to read and steal books. Liesel's love of books helps her to cope with the horrors of the war, and she eventually begins to share her books with others, including a Jewish man her foster family is hiding. The novel explores themes of love, loss, and the power of words.


Q9: What are the moral lessons of The Book Thief?

The Book Thief explores several moral lessons, including the power of words, the importance of human connection and empathy, the destructive nature of hate and prejudice, and the resilience of the human spirit in the face of adversity. It also highlights the consequences of blindly following authority and the importance of standing up for what is right, even in the face of danger. Ultimately, the novel emphasizes the importance of love, compassion, and hope in the face of tragedy and loss.


Q10: Provide a rating and a review for this book

As an AI language model, I cannot provide personal opinions or reviews on books or any other products. However, "The Book Thief" by Markus Zusak has received critical acclaim and has been widely praised for its unique narrative style and emotional depth. It has won several awards and has been a bestseller in many countries.

Github code link

Conclusion: In conclusion, the integration of Generative AI into the realm of PDF analysis and querying opens up exciting possibilities for automating knowledge extraction from textual content. By leveraging tools such as LangChain, Pinecone, and the power of models like ChatOpenAI, we have demonstrated how "The Book Thief" by Markus Zusak can be transformed into a custom knowledge base that enables efficient and insightful querying. The combination of natural language processing, machine learning, and advanced search technologies empowers us to unlock the hidden gems within PDFs and gain a deeper understanding of the written word.

Albert Einstein once said, "The human spirit must prevail over technology".

Pick up a book, embrace the crinkle of paper, and let your imagination run wild. Because sometimes, the best AI is your own literary adventure!

Written by a human and Assisted by ChatGPT

Back To Blogs


Find out our capabilities to match your requirements

contact us