extructvsembed
extruct is a library for extracting embedded metadata from HTML markup.
Currently, extruct supports:
- W3C's HTML Microdata
- embedded JSON-LD
- Microformat via mf2py
- Facebook's Open Graph
- (experimental) RDFa via rdflib
- Dublin Core Metadata (DC-HTML-2003)
PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web service (youtube, vimeo, flickr, instagram, etc) and has adapters to some sites like (archive.org, github, facebook, etc).
Example Use
# retrieve HTML content
import httpx
response = httpx.get('https://webscraping.fyi/lib/python/extruct')
import extruct
all_data = extruct.extract(response.text, response.url)
# or we can extract specific metadata format by importing individuals extractors:
extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)
extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text)
use Embed\Embed;
$embed = new Embed();
//Load any url:
$info = $embed->get('https://www.youtube.com/watch?v=PP1xn5wHtxE');
//Get content info
$info->title; //The page title
$info->description; //The page description
$info->url; //The canonical url
$info->keywords; //The page keywords
$info->image; //The thumbnail or main image
$info->code->html; //The code to embed the image, video, etc
$info->code->width; //The exact width of the embed code (if exists)
$info->code->height; //The exact height of the embed code (if exists)
$info->code->ratio; //The aspect ratio (width/height)
$info->authorName; //The resource author
$info->authorUrl; //The author url
$info->cms; //The cms used
$info->language; //The language of the page
$info->languages; //The alternative languages
$info->providerName; //The provider name of the page (Youtube, Twitter, Instagram, etc)
$info->providerUrl; //The provider url
$info->icon; //The big icon of the site
$info->favicon; //The favicon of the site (an .ico file or a png with up to 32x32px)
$info->publishedTime; //The published time of the resource
$info->license; //The license url of the resource
$info->feeds; //The RSS/Atom feeds