Skip to content

extruct

728 10 43 BSD
0.14.0 (2 months ago) Oct 27 2015 48.7 thousand (month)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib
  • Dublin Core Metadata (DC-HTML-2003)

Example Use


# retrieve HTML content
import httpx

response = httpx.get('https://webscraping.fyi/lib/python/extruct')

import extruct

all_data = extruct.extract(response.text, response.url)

# or we can extract specific metadata format by importing individuals extractors:


extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)

extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text) 

Alternatives / Similar


1,349 2020.1.16 (3 years ago) Dec 14 2008 compare
12,365 0.2.8 (4 years ago) Dec 28 2012 compare
722 1.4.0 (3 months ago) Jul 17 2019 compare
2,206 0.8.1 (2 years ago) Jun 30 2011 compare
3,007 0.11.0 (3 months ago) Oct 20 2013 compare
9,316 1.1.9 (4 years ago) Aug 24 2018 compare
70 2.0.7 (2 months ago) Dec 11 2020 compare

Other Languages

2,028 v1.1.3 (1 year, 9 months ago) Apr 20 2016 compare
1,953 v4.4.7 (a month ago) Oct 26 2013 compare