extractnetvsreadability

MIT 3 9 70

330 (month) Dec 11 2020 2.0.7 (2 months ago)

2,206 4 35 Apache License 2.0

0.8.1 (2 years ago) Jun 30 2011 57.6 thousand (month)

python-readability is a python package that allows developers to extract the main content of a web page, removing any unnecessary or unwanted elements, such as ads, navigation, and sidebars.

It is based on the algorithm used by the popular web-based service, Readability, and it uses the beautifulsoup4 package to parse the HTML and extract the main content.

Readability is similar to Newspaper in terms that it's extracting HTML data

Example Use

from extractNet.extractNet import extractNet

#Initialize the model
en = extractNet()

#Extract structured data from text
text = "My phone number is 555-555-5555 and my email address is example@example.com"
data = en.extract(text)

#Print the extracted data
print(data)
{'phone_number': '555-555-5555', 'email': 'example@example.com'}

import requests
from readability import document

response = requests.get('http://example.com')
doc = document(response.content)
doc.title()
'example domain'

doc.summary()
"""<html><body><div><body id="readabilitybody">\n<div>\n    <h1>example domain</h1>\n
<p>this domain is established to be used for illustrative examples in documents. you may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">more information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Alternatives / Similar

html2text

1,349 compare

newspaper

12,365 compare

trafilatura

722 compare

gofeed

2,028 compare

extruct

728 compare

sumy

3,007 compare

photon

9,316 compare

extractnet

70 compare