NewLeaf/extractors/captions.py

import re
import requests
from extractors.video import extract_video
from tools.converters import escape_html_textcontent, get_subtitle_api_url
from urllib.parse import urlencode
import xml.etree.ElementTree as ET

def extract_captions(id, **kwargs):
	captions = extract_captions_from_video(id)
	return extract_captions_from_dict(captions, **kwargs)

# Return captions for the language specified,
# The captions list otherwise
def extract_captions_from_dict(captions, *, lang=None, label=None):
	if lang is None and label is None:
		return captions

	url = next(caption["second__remoteUrl"] for caption in captions["captions"] if caption["languageCode"] == lang or caption["label"] == label)
	r = requests.get(url)
	r.raise_for_status()
	# remove extraneous " align:start position:0%" on timestamps lines on auto-generated captions
	if (lang and "auto-generated" in lang) or (label and "auto-generated" in label):
		return re.sub(r"^([0-9:.]+ --> [0-9:.]+).*$", r"\1", r.content.decode("utf8"), flags=re.MULTILINE)
	return r

def extract_captions_from_video(id):
	return {
		"captions": extract_video(id)["captions"]
	}
Remove extraneous " align:start position:0%" on auto-generated captions 2021-04-05 00:06:45 +00:00			`import re`
Implement captions Automatic subtitles are not supported, because youtube_dlc does not provide them. 2021-01-17 22:59:14 +00:00			`import requests`
			`from extractors.video import extract_video`
			`from tools.converters import escape_html_textcontent, get_subtitle_api_url`
			`from urllib.parse import urlencode`
			`import xml.etree.ElementTree as ET`

			`def extract_captions(id, **kwargs):`
Fix regular captions This removes all of the code that was previously used to get them from /timedtext, and instead, always uses whatever is extracted from the video page. This does unfortunately now require a whole video fetch just for the captions. But assuming captions are only requested by a frontend, this won't be a problem due to the memory cache. The captions link will be in memory because the just-requested video is in memory too. 2021-11-20 07:40:34 +00:00			`captions = extract_captions_from_video(id)`
Implement captions Automatic subtitles are not supported, because youtube_dlc does not provide them. 2021-01-17 22:59:14 +00:00			`return extract_captions_from_dict(captions, **kwargs)`

			`# Return captions for the language specified,`
			`# The captions list otherwise`
Captions: Python code cleanup and optimisation 2021-01-20 04:35:13 +00:00			`def extract_captions_from_dict(captions, *, lang=None, label=None):`
			`if lang is None and label is None:`
Implement captions Automatic subtitles are not supported, because youtube_dlc does not provide them. 2021-01-17 22:59:14 +00:00			`return captions`

Captions: Python code cleanup and optimisation 2021-01-20 04:35:13 +00:00			`url = next(caption["second__remoteUrl"] for caption in captions["captions"] if caption["languageCode"] == lang or caption["label"] == label)`
Remove `with requests` when it is unnecessary 2022-01-16 08:51:26 +00:00			`r = requests.get(url)`
			`r.raise_for_status()`
			`# remove extraneous " align:start position:0%" on timestamps lines on auto-generated captions`
			`if (lang and "auto-generated" in lang) or (label and "auto-generated" in label):`
			`return re.sub(r"^([0-9:.]+ --> [0-9:.]+).*$", r"\1", r.content.decode("utf8"), flags=re.MULTILINE)`
			`return r`
Implement captions Automatic subtitles are not supported, because youtube_dlc does not provide them. 2021-01-17 22:59:14 +00:00
Support auto-generated captions The caption extraction is now entirely in our own hands. 2021-04-04 13:23:54 +00:00			`def extract_captions_from_video(id):`
			`return {`
			`"captions": extract_video(id)["captions"]`
			`}`