Skip to main content
knackforge blog knowledge base

Auto summarizing news articles using Natural Language Processing (NLP)

Natural language processing has become a hot topic in recent days and we find working with such technologies very exciting. We decided to come with a nice Summarization tool that can give summary of a given URL along with top image, top video from the content, thus giving birth to our service, named 'PithPicker' Engine. Generally summary can be rewritten with original text or extracting key sentences from the text. Second approach will work better in most cases, hence becomes our preferred method. 

So how does PithPicker stands unique in the so called NLP world? To explain that a bit briefly, please proceed with your reading. 

  1. Extracting the pivotal text from the given URLs
  2. Summarizing the given paragraph

Extracting text from the given URLs

We have the script that takes web URLs to extract full content and then HTML content is passed through a template removal process so that we extract the main article after excluding headers, footers, advertisements and sidebars. Then it is fed to a DOM parser that can extract article data. Article data includes meta description, title, actual content, top image and top video. The actual content is then fed to the summarization tool as described below.

Summarizing the text

The content and title from extraction step are used here as input. We use the following main criteria to identify top 5 sentences,
  • Title words' presence
  • Length of the sentences
  • Sentence position
  • Keywords' presence on sentences and their intersection
  • Frequency of occurrence
Then we wrote a powerful multiprocessing script that can help us in parallel processing. This way we can handle multiple requests for summarization. 

Mashape

We have recently hosted our tool in Mashape API Marketplace. Mashape is basically a API gateway where developers can host their APIs and other users can consume it. Click below to view our API in mashape,
It can take batch of URLs as input and return results. 

Accuracy

Regarding the accuracy, we have taken some random number of URLs and our summarization script was successful on around 85% of the URLs and while a few URLs were not actually news articles. The top image and video fetching worked well for most of the cases. To add a note, our algorithm focuses on news articles and works well for them. I will soon add more technical details of the PithPicker Engine API usage in my next blog. Meanwhile you can play with our service in Mashape where 100 URLs processing per day are allowed for Free!

Applications

Some real time applications that PithPicker covers but not limited includes summarizing content from various web pages, Emails, documents, knowledge management archives with proper wrapper.

Add new comment

The content of this field is kept private and will not be shown publicly.

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.