Skip to content

extractnetvsextruct

MIT 3 9 70
330 (month) Dec 11 2020 2.0.7 (2 months ago)
728 10 43 BSD
0.14.0 (2 months ago) Oct 27 2015 48.7 thousand (month)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib
  • Dublin Core Metadata (DC-HTML-2003)

Example Use


from extractNet.extractNet import extractNet

#Initialize the model
en = extractNet()

#Extract structured data from text
text = "My phone number is 555-555-5555 and my email address is example@example.com"
data = en.extract(text)

#Print the extracted data
print(data)
{'phone_number': '555-555-5555', 'email': 'example@example.com'}
# retrieve HTML content
import httpx

response = httpx.get('https://webscraping.fyi/lib/python/extruct')

import extruct

all_data = extruct.extract(response.text, response.url)

# or we can extract specific metadata format by importing individuals extractors:


extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)

extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text) 

Alternatives / Similar