Summary extraction of an article using experimental NLP techniques

More from Author
Sai Kamal
Sai Kamal

Software Engineer

6 min read

Summary Extraction is the technique for generating meaningful and complete information of the text while focusing on the sections that convey useful information, and without losing the overall meaning of the text. Summary Extraction aims to transform lengthy Text into shortened versions, which makes it easy to understand.

Techniques and Libraries used

  • NumPy
  • nltk
  • Spacy
  • newspaper3k
  • Regular Expression

Summary Extracting newspaper3k Library.

Using the newspaper Library we can collect the complete text of the Article just by using a predefined function called Article. There are some advantages and disadvantages to using the newspaper3k library.


    >>> from newspaper import Article

    >>> url = “Enter the url”>>> 

    article = Article(url)


      

The complete text of an article was in a variable called article.

Advantages of using newspaper3k

  • Predefined functions
  • Easy to access the complete text of an Article
  • Easy to collect the Author names and Publish dates of an article with authors and publish_date functions.
  • Using the newspaper3k library it is very easy to collect the summary of an article using a variable. summary function.

Disadvantages of using newspaper3k

  • Cannot download the complete URLs of the website if we run the code n number of times
  • The summary collected using newspaper3k was not as expected
  • Using a newspaper library becomes difficult during the time of deployment on Heroku. Because Heroku has no corpora keywords as the default installed, we need to provide corpora keywords in a text file while deployment.

How Summary Extracted using newspaper3k Library

Punkt can pick its first 5 tokenized sentences and string them together to form a complete summary. Punkt helps us to divide the complete text into a list of sentences.

To overcome the disadvantages of newspapers 3k and to collect meaningful summaries. Now we are using a Library called spacy.

Before using the techniques of the spacy library we just took the help of the nltk library to convert the complete text into sentences. For this purpose, we are using a function called tokenizer.

  • Sent_tokenizer
  • Word_tokenizer

Using the spacy Library we are removing the complete string punctuation and unwanted words. which completely harms the data.

. Full stop or Period () Round Brackets
, Comma [] Square Brackets
; Semi-colon "" Double Inverted Quamma's
? Quotation mark ... Ellipsis marks
! Exclamation mark / Slash
' Apostrophe _ Underscore
  Underline - Hyphen
: Colon @ At sign

Instead of collecting top 5 sentence data as a summary. Using spacy will just give the importance for each word called text normalization and will divide them with the complete sentence length of the data. The highly important data will be collected.

We can collect the summary depending on probability values from the result. 0.3 probability amount of text gives a complete explanation of the summary. After collecting the summary it's better to do the cleaning part for better understanding. Used regular expression and special sequences and metacharacters techniques for data cleaning.

Metacharacters

[] returns a match if contains patterns/characters specified in []
^ the string starts with given patterns
$ ends with
. any character except newline
* Zero or more occurrences
+ one or more occurrences
{} a specified number of occurrences

Special sequences

\d If a given string has digits (0-9)
\D If the given string does not have strings
\w If a given string has word characters(a-z, A-Z,0-9)
\W If a given string does not have word characters
\s If the given string has spaces
\S If the given string has no spaces

Techniques used for cleaning the data

  • Removing unwanted Punctuation for the text
  • Removing URLs from the text
  • Removing hashtags
  • Removing Extra white spaces
  • Removing the own mentions

Few Examples

Article URL

Summary Extracted

This style Pochampally Ikat received a Geographical Indicator GI status in 2004 and is also known as Bhoodan Pochampally to commemorate the Bhoodan movement that was launched by Acharya Vinobha Bhave from this village on April 18 1951. The ministry of tourism said it has drafted a rural tourism policy which will not only promote tourism within our villages but also revitalise local arts and crafts and promote the rural economy. The best tourism villages by UNWTO pilot initiative aims to award those villages which are outstanding examples of rural destinations and showcases good practices in line with its specified evaluation areas. HYDERABAD: Pochampally village in Yadadri Bhuvanagiri district known for its famous hand-woven Ikat saris was on Tuesday selected as one of the best tourism villages by the United Nations World Tourism Organisation UNWTO. The award will be given at the 24th session of the UNWTO general assembly on December 2 in Madrid.


Article URL

Summary Extracted

BSP is projected to lose a significant share of its votes to both SP and BJP and finish third with around 30 seats while Congress could end up with five to eight seats not very different from the seven it won in 2017. If the projections turn out to be aurate Yogi Adityanath would become the first chief minister in Uttar Pradesh to serve two consecutive terms. The opinion poll indicated strong support for the Yogi government's hard-line approach on law and order as well as to a lesser extent for its legal route to countering 'forced' conversions.


Article URL

Summary Extracted

These include shutting down all except five thermal power plants within a 300km radius of Delhi till November 30 stopping entry of trucks in Delhi except for those carrying essential commodities keeping diesel and petrol vehicles more than 10 and 15 years old respectively in NCR off the road and banning construction and demolition activities in NCR till November 21 except for some government and infrastructure projects. This followed a meeting with NCR states earlier in the day with the focus on vehicular pollution dust pollution from construction activities and roads and emissions from thermal power plants and industrial pollution. At least 50 of the government staff across NCR will work from home and private establishments will be encouraged to do so till November 21.Among other measures announced by the commission are banning of DG sets in entire NCR except for emergency services and ensuring that all industries in NCR with gas connections are run only on gas failing which they are to be shut down. Delhi's air quality deteriorated on Tuesday again entering the severe category at 403.Of the 11 thermal power plants the five that have been allowed to function are NTPC Jhajjar Mahatma Gandhi TPS CLP Jhajjar Panipat TPS HPGCL Nabha Power Ltd TPS Rajpura and Talwandi Sabo TPS Mansa.

Back To Blogs


Find out our capabilities to match your requirements

contact us