extractnetvsextruct
extruct is a library for extracting embedded metadata from HTML markup.
Currently, extruct supports:
- W3C's HTML Microdata
- embedded JSON-LD
- Microformat via mf2py
- Facebook's Open Graph
- (experimental) RDFa via rdflib
- Dublin Core Metadata (DC-HTML-2003)
Example Use
from extractNet.extractNet import extractNet
#Initialize the model
en = extractNet()
#Extract structured data from text
text = "My phone number is 555-555-5555 and my email address is example@example.com"
data = en.extract(text)
#Print the extracted data
print(data)
{'phone_number': '555-555-5555', 'email': 'example@example.com'}
# retrieve HTML content
import httpx
response = httpx.get('https://webscraping.fyi/lib/python/extruct')
import extruct
all_data = extruct.extract(response.text, response.url)
# or we can extract specific metadata format by importing individuals extractors:
extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)
extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text)