Natural language processing has become a hot topic in recent days and we find working with such technologies very exciting. We decided to come up with a nice Summarization tool that can give a summary of a given URL along with the top image, and top video from the content, thus giving birth to our service, named 'PithPicker' Engine. Generally, a summary can be rewritten with the original text or extracting key sentences from the text. The second approach will work better in most cases, hence becomes our preferred method.
So how does PithPicker stands unique in the so-called NLP world? To explain that a bit briefly, please proceed with your reading.
- Extracting the pivotal text from the given URLs
- Summarizing the given paragraph
Extracting text from the given URLs
We have the script that takes web URLs to extract full content and then
HTML content is passed through a template removal process so that we extract the main article after excluding headers, footers, advertisements, and sidebars. Then it is fed to a DOM parser that can extract article data. Article data includes meta description, title, actual content, top image, and top video. The actual content is then fed to the summarization tool as described below.
Summarizing the text
The content and title from the extraction step are used here as input. We use the following main criteria to identify the top 5 sentences,
- Title words' presence
- Length of the sentences
- Sentence position
- Keywords' presence in sentences and their intersection
- Frequency of occurrence
Then we wrote a powerful multiprocessing script that can help us in parallel processing. This way we can handle multiple requests for summarization.
Mashape
We have recently hosted our tool in Mashape API Marketplace. Mashape is basically an API gateway where developers can host their APIs and other users can consume them. Click below to view our
API in mashape,
It can take a batch of URLs as input and return results.
Accuracy
Regarding the accuracy, we have taken some random number of URLs and our summarization script was successful on around 85% of the URLs and while a few URLs were not actually news articles. The top image and video fetching worked well for most of the cases. To add a note, our algorithm focuses on news articles and works well for them. I will soon add more technical details of the PithPicker Engine API usage in my next blog. Meanwhile, you can play with our service in Mashape where 100 URLs processing per day is allowed for Free!
Applications
Some real-time applications that PithPicker covers but are not limited to include summarizing content from various web pages, Emails, documents, and knowledge management archives with a proper wrapper.