I am immersed in a personal project focused on organizing a YouTube channel, and I am facing specific challenges that need addressing. On YouTube, there are three main types of content: VIDEOS, SHORTS and LIVE.
Context of Content Types on YouTube:VIDEOS: The traditional YouTube format with no duration restrictions.SHORTS: Short videos, up to 60 seconds, in vertical format.LIVE: Live broadcasts that encourage real-time interaction with the audience.
Challenge in the VIDEOS Section:In the VIDEOS section of YouTube, diversity is noticeable. It includes scheduled live stream videos and short clips that do not fit the SHORTS categorization.Scheduled live stream videos: Creators schedule broadcasts for specific dates and times. These videos appear in VIDEOS and not in LIVE.Uncategorized short clips: Videos with a duration of 60 seconds or less that do not meet specific requirements to be classified as "SHORTS" (such as vertical format, among others).
It is relevant to mention that I am using the YouTube Data API v3 to extract information efficiently.
I would like to share that I am relatively new to this field and am in the learning process. I appreciate your patience and any guidance you can provide. If you notice any clumsiness in my approach, I would be delighted to receive advice for improvement.
Here is the relevant portion of my Python script, YouTube Channel Scraper:
import osimport reimport requestsfrom googleapiclient.discovery import buildfrom datetime import datetime, timedelta# Function to extract YouTube channel ID from the provided linkdef get_channel_id(channel_link): try: # Make a request to the provided YouTube channel link response = requests.get(channel_link) # Check if the request was successful (status code 200) if response.status_code == 200: # Define a regex pattern to extract the channel ID from the XML feed link pattern = r"https://www.youtube.com/feeds/videos.xml\?channel_id=([A-Za-z0-9_-]+)" # Search for the pattern in the response text match = re.search(pattern, response.text) # If a match is found, return the extracted channel ID if match: return match.group(1) else: print("No channel found in the provided link.") else: print("Could not access the link. Make sure the link is valid.") except requests.exceptions.RequestException as e: print("An error occurred while making the request:", str(e)) except Exception as e: print("An error occurred:", str(e)) return None# Function to parse the duration string and extract hours, minutes, and secondsdef parse_duration(duration): duration = duration[2:] hours, minutes, seconds = 0, 0, 0 if 'H' in duration: hours = int(duration.split('H')[0]) duration = duration.split('H')[1] if 'M' in duration: minutes = int(duration.split('M')[0]) duration = duration.split('M')[1] if 'S' in duration: seconds = int(duration.split('S')[0]) return hours, minutes, seconds# Function to save video information to a filedef save_video_info_to_file(output_file, video_info): title = video_info["snippet"]["title"] views = video_info["statistics"].get("viewCount", "N/A") likes = video_info["statistics"].get("likeCount", "N/A") upload_date = video_info["snippet"]["publishedAt"] hours, minutes, seconds = parse_duration(video_info["contentDetails"]["duration"]) # Convert upload date to GMT-5 timezone, this is my time zone upload_datetime = datetime.fromisoformat(upload_date[:-1]) upload_datetime_gmt5 = upload_datetime - timedelta(hours=5) # Adjust for videos uploaded before 5 AM GMT-5 if upload_datetime_gmt5.hour < 5: upload_datetime_gmt5 -= timedelta(days=1) formatted_upload_date = upload_datetime_gmt5.strftime("%d/%m/%Y") formatted_upload_time = upload_datetime_gmt5.strftime("%H:%M:%S") duration_str = "" if hours > 0: duration_str += f"{hours} hour{'s' if hours > 1 else ''}" if minutes > 0: if duration_str: duration_str += ", " duration_str += f"{minutes} minute{'s' if minutes > 1 else ''}" if seconds > 0: if duration_str: duration_str += " and " duration_str += f"{seconds} second{'s' if seconds > 1 else ''}" # Write video information to the output file with open(output_file, "a", encoding="utf-8") as file: file.write("Title: " + title +"\n") file.write("Upload Date: " + formatted_upload_date +"\n") file.write("Upload Time: " + formatted_upload_time +"\n") file.write("Duration: " + duration_str +"\n") file.write("Views: " + str(views) +"\n") file.write("Likes: " + str(likes) +"\n\n\n")# Function to get channel name and save video information to a filedef get_channel_name(channel_id): api_key = "[YOUR API HERE]" gmt_offset = -5 # Build the YouTube API service youtube = build("youtube", "v3", developerKey=api_key) videos_info = [] # Fetch videos information from the channel next_page_token = None while True: videos_response = youtube.search().list( part="id", channelId=channel_id, maxResults=50, pageToken=next_page_token ).execute() video_ids = [item["id"]["videoId"] for item in videos_response.get("items", []) if "videoId" in item.get("id", {})] videos_details_response = youtube.videos().list( part="snippet,statistics,contentDetails", id=",".join(video_ids) ).execute() videos_info.extend(videos_details_response["items"]) next_page_token = videos_response.get("nextPageToken") if not next_page_token: break videos_info.sort(key=lambda x: x["snippet"]["publishedAt"], reverse=True) # Fetch channel information channel_info = youtube.channels().list( part="snippet", id=channel_id ).execute() # Get the channel name or use a default if not available if channel_info.get("items"): channel_name = channel_info["items"][0]["snippet"]["title"] else: channel_name = "Unknown Channel" # Modify the channel name for file naming channel_name = re.sub(r'[^\w\s]', '', channel_name) channel_name = channel_name.replace(" ", "_") # Set the output file name output_file = f"{channel_name}.txt" # Save video information to the output file for video_info in videos_info: save_video_info_to_file(output_file, video_info) print("Information has been saved to the file:", output_file)# Entry point of the scriptif __name__ == "__main__": # Prompt user to input the YouTube channel link channel_link = input("Enter the YouTube channel link: ") # Get the channel ID from the provided link channel_id = get_channel_id(channel_link) # If a valid channel ID is obtained, get the channel name and save video information if channel_id: get_channel_name(channel_id)My goal is to refine the script to achieve more accurate and efficient classification, considering these complexities.
I've been diving into the intricacies of refining this Python script for efficient YouTube channel data extraction and organization. In my attempts so far, I experimented with optimizing the regex patterns for better channel ID extraction and fine-tuning the duration parsing logic.
I expected these adjustments to enhance the script's accuracy in classifying videos, especially in the 'VIDEOS' section. However, the results were not as anticipated. I'm reaching out for your expertise to gain fresh insights and suggestions.
I appreciate in advance for your valuable contribution and patience!