Refining Python Script for Efficient YouTube Channel Data Extraction and Organization

I am immersed in a personal project focused on organizing a YouTube channel, and I am facing specific challenges that need addressing. On YouTube, there are three main types of content: VIDEOS, SHORTS and LIVE.

Context of Content Types on YouTube:VIDEOS: The traditional YouTube format with no duration restrictions.SHORTS: Short videos, up to 60 seconds, in vertical format.LIVE: Live broadcasts that encourage real-time interaction with the audience.

Challenge in the VIDEOS Section:In the VIDEOS section of YouTube, diversity is noticeable. It includes scheduled live stream videos and short clips that do not fit the SHORTS categorization.Scheduled live stream videos: Creators schedule broadcasts for specific dates and times. These videos appear in VIDEOS and not in LIVE.Uncategorized short clips: Videos with a duration of 60 seconds or less that do not meet specific requirements to be classified as "SHORTS" (such as vertical format, among others).

It is relevant to mention that I am using the YouTube Data API v3 to extract information efficiently.

I would like to share that I am relatively new to this field and am in the learning process. I appreciate your patience and any guidance you can provide. If you notice any clumsiness in my approach, I would be delighted to receive advice for improvement.

Here is the relevant portion of my Python script, YouTube Channel Scraper:

import osimport reimport requestsfrom googleapiclient.discovery import buildfrom datetime import datetime, timedelta# Function to extract YouTube channel ID from the provided linkdef get_channel_id(channel_link):    try:        # Make a request to the provided YouTube channel link        response = requests.get(channel_link)        # Check if the request was successful (status code 200)        if response.status_code == 200:            # Define a regex pattern to extract the channel ID from the XML feed link            pattern = r"https://www.youtube.com/feeds/videos.xml\?channel_id=([A-Za-z0-9_-]+)"            # Search for the pattern in the response text            match = re.search(pattern, response.text)            # If a match is found, return the extracted channel ID            if match:                return match.group(1)            else:                print("No channel found in the provided link.")        else:            print("Could not access the link. Make sure the link is valid.")    except requests.exceptions.RequestException as e:        print("An error occurred while making the request:", str(e))    except Exception as e:        print("An error occurred:", str(e))    return None# Function to parse the duration string and extract hours, minutes, and secondsdef parse_duration(duration):    duration = duration[2:]    hours, minutes, seconds = 0, 0, 0    if 'H' in duration:        hours = int(duration.split('H')[0])        duration = duration.split('H')[1]    if 'M' in duration:        minutes = int(duration.split('M')[0])        duration = duration.split('M')[1]    if 'S' in duration:        seconds = int(duration.split('S')[0])    return hours, minutes, seconds# Function to save video information to a filedef save_video_info_to_file(output_file, video_info):    title = video_info["snippet"]["title"]    views = video_info["statistics"].get("viewCount", "N/A")    likes = video_info["statistics"].get("likeCount", "N/A")    upload_date = video_info["snippet"]["publishedAt"]    hours, minutes, seconds = parse_duration(video_info["contentDetails"]["duration"])    # Convert upload date to GMT-5 timezone, this is my time zone    upload_datetime = datetime.fromisoformat(upload_date[:-1])    upload_datetime_gmt5 = upload_datetime - timedelta(hours=5)    # Adjust for videos uploaded before 5 AM GMT-5    if upload_datetime_gmt5.hour < 5:        upload_datetime_gmt5 -= timedelta(days=1)    formatted_upload_date = upload_datetime_gmt5.strftime("%d/%m/%Y")    formatted_upload_time = upload_datetime_gmt5.strftime("%H:%M:%S")    duration_str = ""    if hours > 0:        duration_str += f"{hours} hour{'s' if hours > 1 else ''}"    if minutes > 0:        if duration_str:            duration_str += ", "        duration_str += f"{minutes} minute{'s' if minutes > 1 else ''}"    if seconds > 0:        if duration_str:            duration_str += " and "        duration_str += f"{seconds} second{'s' if seconds > 1 else ''}"    # Write video information to the output file    with open(output_file, "a", encoding="utf-8") as file:        file.write("Title: " + title +"\n")        file.write("Upload Date: " + formatted_upload_date +"\n")        file.write("Upload Time: " + formatted_upload_time +"\n")        file.write("Duration: " + duration_str +"\n")        file.write("Views: " + str(views) +"\n")        file.write("Likes: " + str(likes) +"\n\n\n")# Function to get channel name and save video information to a filedef get_channel_name(channel_id):    api_key = "[YOUR API HERE]"    gmt_offset = -5    # Build the YouTube API service    youtube = build("youtube", "v3", developerKey=api_key)    videos_info = []    # Fetch videos information from the channel    next_page_token = None    while True:        videos_response = youtube.search().list(            part="id",            channelId=channel_id,            maxResults=50,            pageToken=next_page_token        ).execute()        video_ids = [item["id"]["videoId"] for item in videos_response.get("items", []) if "videoId" in item.get("id", {})]        videos_details_response = youtube.videos().list(            part="snippet,statistics,contentDetails",            id=",".join(video_ids)        ).execute()        videos_info.extend(videos_details_response["items"])        next_page_token = videos_response.get("nextPageToken")        if not next_page_token:            break    videos_info.sort(key=lambda x: x["snippet"]["publishedAt"], reverse=True)    # Fetch channel information    channel_info = youtube.channels().list(        part="snippet",        id=channel_id    ).execute()    # Get the channel name or use a default if not available    if channel_info.get("items"):        channel_name = channel_info["items"][0]["snippet"]["title"]    else:        channel_name = "Unknown Channel"    # Modify the channel name for file naming    channel_name = re.sub(r'[^\w\s]', '', channel_name)    channel_name = channel_name.replace(" ", "_")    # Set the output file name    output_file = f"{channel_name}.txt"    # Save video information to the output file    for video_info in videos_info:        save_video_info_to_file(output_file, video_info)    print("Information has been saved to the file:", output_file)# Entry point of the scriptif __name__ == "__main__":    # Prompt user to input the YouTube channel link    channel_link = input("Enter the YouTube channel link: ")    # Get the channel ID from the provided link    channel_id = get_channel_id(channel_link)    # If a valid channel ID is obtained, get the channel name and save video information    if channel_id:        get_channel_name(channel_id)

My goal is to refine the script to achieve more accurate and efficient classification, considering these complexities.

I've been diving into the intricacies of refining this Python script for efficient YouTube channel data extraction and organization. In my attempts so far, I experimented with optimizing the regex patterns for better channel ID extraction and fine-tuning the duration parsing logic.

I expected these adjustments to enhance the script's accuracy in classifying videos, especially in the 'VIDEOS' section. However, the results were not as anticipated. I'm reaching out for your expertise to gain fresh insights and suggestions.

I appreciate in advance for your valuable contribution and patience!

Refining Python Script for Efficient YouTube Channel Data Extraction and Organization

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...