Natural language processing has become a hot topic in recent days and we find working with such technologies very exciting. We decided to come with a nice Summarization tool that can give summary of a given URL along with top image, top video from the content, thus giving birth to our service, named 'PithPicker' Engine. Generally summary can be rewritten with original text or extracting key sentences from the text. Second approach will work better in most cases, hence becomes our preferred method.
So how does PithPicker stands unique in the so called NLP world? To explain that a bit briefly, please proceed with your reading.
- Extracting the pivotal text from the given URLs
- Summarizing the given paragraph
Extracting text from the given URLs
We have the script that takes web URLs to extract full content and then HTML content is passed through a template removal process so that we extract the main article after excluding headers, footers, advertisements and sidebars. Then it is fed to a DOM parser that can extract article data. Article data includes meta description, title, actual content, top image and top video. The actual content is then fed to the summarization tool as described below.
Summarizing the text
The content and title from extraction step are used here as input. We use the following main criteria to identify top 5 sentences,
- Title words' presence
- Length of the sentences
- Sentence position
- Keywords' presence on sentences and their intersection
- Frequency of occurrence
Then we wrote a powerful multiprocessing script that can help us in parallel processing. This way we can handle multiple requests for summarization.
We have recently hosted our tool in Mashape API Marketplace. Mashape is basically a API gateway where developers can host their APIs and other users can consume it. Click below to view our API in mashape,
It can take batch of URLs as input and return results.
Regarding the accuracy, we have taken some random number of URLs and our summarization script was successful on around 85% of the URLs and while a few URLs were not actually news articles. The top image and video fetching worked well for most of the cases. To add a note, our algorithm focuses on news articles and works well for them. I will soon add more technical details of the PithPicker Engine API usage in my next blog. Meanwhile you can play with our service
in Mashape where 100 URLs processing per day are allowed for Free!
Some real time applications that PithPicker covers but not limited includes summarizing content from various web pages, Emails, documents, knowledge management archives with proper wrapper.