python report and need the explanation and answer to help me learn.

Refer to attached for details requirement.
Kindly answer question 1 , 2 & 3 plus their sub-question appropriately. I have attached study guide for your reference too.
Requirements: 5 pages
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 1 of 8 ICT233 End-of-Course Assessment – July Semester 2023 Data Programming INSTRUCTIONS TO STUDENTS: 1. This End-of-Course Assessment paper comprises 8 pages (including the cover page). 2. You are to include the following particulars in your submission: Course Code, Title of the ECA, SUSS PI No., Your Name, and Submission Date. 3. Late submission will be subjected to the marks deduction scheme. Please refer to the Student Handbook for details. IMPORTANT NOTE ECA Submission Deadline: Sunday, 05 November 2023 12:00 pm
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 2 of 8 ECA Submission Guidelines Please follow the submission instructions stated below: This ECA carries 70% of the course marks and is a compulsory component. It is to be done individually and not collaboratively with other students. Submission You are to submit the ECA assignment in exactly the same manner as your tutor-marked assignments (TMA), i.e. using Canvas. Submission in any other manner like hardcopy or any other means will not be accepted. Electronic transmission is not immediate. It is possible that the network traffic may be particularly heavy on the cut-off date and connections to the system cannot be guaranteed. Hence, you are advised to submit your assignment the day before the cut-off date in order to make sure that the submission is accepted and in good time. Once you have submitted your ECA assignment, the status is displayed on the computer screen. You will only receive a successful assignment submission message if you had applied for the e-mail notification option. ECA Marks Deduction Scheme Please note the following: (a) Submission Cut-off Time – Unless otherwise advised, the cut-off time for ECA submission will be at 12:00 noon on the day of the deadline. All submission timings will be based on the time recorded by Canvas. (b) Start Time for Deduction – Students are given a grace period of 12hours. Hence calculation of late submissions of ECAs will begin at 00:00 hrs the following day (this applies even if it is a holiday or weekend) after the deadline. (c) How the Scheme Works – From 00:00 hrs the following day after the deadline, 10 marks will be deducted for each 24-hour block. Submissions that are subject to more than 50 marks deduction will be assigned zero mark. For examples on how the scheme works, please refer to Section 5.2 Para 1.7.3 of the Student Handbook. Any extra files, missing appendices or corrections received after the cut-off date will also not be considered in the grading of your ECA assignment. Plagiarism and Collusion Plagiarism and collusion are forms of cheating and are not acceptable in any form of a student’s work, including this ECA assignment. You can avoid plagiarism by giving appropriate references when you use some other people’s ideas, words or pictures (including diagrams). Refer to the American Psychological Association (APA) Manual if you need reminding about quoting and referencing. You can avoid collusion by ensuring that your submission is based on your own individual effort. The electronic submission of your ECA assignment will be screened through a plagiarism detecting software. For more information about plagiarism and cheating, you should refer to the Student Handbook. SUSS takes a tough stance against plagiarism and collusion. Serious cases will normally result in the student being referred to SUSS’s Student Disciplinary Group.
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 3 of 8 (Full marks: 100) Question 1 Question 1a (i) To analyse the webpage and perform the following: • Use the BeautifulSoup and requests libraries for web scraping and Pandas for data manipulation. • Scrape data from a Wikipedia page detailing the list of Korean dramas: • Extract information for each drama series including its title, url, start_year, and end_year into a Pandas data frame. • start_year refers to the year in which the first episode of a particular drama was aired. • end_year refers to the year in which the last episode of that drama was aired. • Assign the value np.nan to start_year and end_year when these fields lack a specific value. (ii) Introduce an additional column named `status` to the data frame, that is updated with the terms ‘cancelled’, ‘completed’, ‘ongoing’, or ‘tba’. (iii) Certain dramas encompass sub-dramas, such as “Drama City (1984–2008)”. To account for this hierarchical relationship, incorporate an additional column named `parent`. This column will store the URL of the parent drama. For instance, the parent value for “Drama City: What Should I Do? (2004)” is the URL corresponding to “Drama City (1984–2008)”. (iv) To perform the following Tasks: • Modify the data type for the start and end years to integer. • Provide a statistical description of the obtained data. (10 marks) Question 1b To analyze the dataset and perform the following tasks: • Calculate the frequency of each unique status value. • Use Pandas’s methods to identify all dramas (title, url, status, start_year, end_year) that have at least one associated sub-drama. • Use Pandas’s methods to identify all dramas (title, url, status, start_year, end_year) that have at least 3 associated sub-drama. (10 marks) Question 1c Plot the number of dramas starting each year to identify trends over time. (2 marks)
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 4 of 8 Question 1d To analyze and perform the following tasks: • Calculates the broadcasting duration of each drama. • If a drama is currently ongoing, consider the current year as the end of its duration. • Determine the dramas with the longest and shortest durations. • Calculate the average broadcasting duration of all dramas. • Detect duplicated dramas and eliminate the duplicates. (5 marks) Question 2 Question 2a To analyze and implement the following tasks: • Use BeautifulSoup to create a function named scrape_drama_details that accepts a Wikipedia drama URL and retrieves details such as genres, main actors/actresses, and the total number of episodes. • Exclude any genres or actors that lack a dedicated Wikipedia page. • Test the function scrape_drama_details with the following drama URLs and display the results: 1. ‘/wiki/12_Years_Promise’ 2. ‘/wiki/100_Days_My_Prince’ 3. ‘/wiki/A_Good_Day_to_Be_a_Dog’ • Invoke the scrape_drama_details function to gather information for all drama URLs (obtained from step 1a)). The data should be organized as follows: [ { ‘url’: ‘/wiki/12_Signs_of_Love’, ‘genres’: [(‘Romantic comedy’, ‘/wiki/Romantic_comedy’)], ‘starrings’: [(‘Yoon Jin-seo’, ‘/wiki/Yoon_Jin-seo’), (‘On Joo-wan’, ‘/wiki/On_Joo-wan’)], ‘episodes’: 16, }, { ‘url’: ‘/wiki/12_Years_Promise’, ‘genres’: [(‘Romantic comedy’, ‘/wiki/Romantic_comedy’)], ‘starrings’: [(‘Lee So-yeon’, ‘/wiki/Lee_So-yeon_(actress)’), (‘Namkoong Min’, ‘/wiki/Namkoong_Min’), (‘Lee Tae-im’, ‘/wiki/Lee_Tae-im’), (‘Yoon So-hee’, ‘/wiki/Yoon_So-hee’), (‘Lee Won-keun’, ‘/wiki/Lee_Won-keun’), (‘Ryu Hyo-young’, ‘/wiki/Ryu_Hyo-young’)], ‘episodes’: 26, }, … ] • Present the details of the first five dramas. (10 marks)
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 5 of 8 Question 2b To formulate and perform the following tasks: • Construct a new Pandas data frame from the Q2a) scraped drama details, including drama_url, genre, and genre_url. For each genre linked to a drama, there is a separate row in the data frame for that drama. • Perform a cleaning process on the new data frame, which includes the following steps: • Discard meaningless genres. • Map genres into these common categories: ‘comedy’, ‘historical’, ‘romance’, ‘fantasy’, ‘slice of life’, ‘crime’, ‘drama’, ‘speculative’, ‘action’, ‘thriller’, ‘horror’, ‘science’, ‘political’, ‘musical’ and ‘sport’ • For genres that do not fit into any of these categories or are not commonly found, map them under an ‘uncommon’ genre. • After the processing, visualize the count of dramas corresponding to each mapped genre. (5 marks) Question 2c To perform the following tasks: • Identify all distinct (genre_url, genre) pairs from the data frame Q2b). • Some genre URLs lead to other URLs through redirection. Use the requests and BeautifulSoup libraries to capture these redirected URLs into a new column redirected_genre_url. Show that the redirected URL of ‘’ is ‘’. • Ensure that the redirected genre URLs adhere to the pattern{genre_name}#{anchor}, where the anchor is not mandatory. Identify any redirected URLs that deviate from this structure and modify them to align with the desired format. • Which redirected genre URLs include an anchor, denoted by #{anchor}? • Note that certain genre URLs lead to the same redirected destinations. As a result, a single redirected URL might be linked to various genres. Use your discretion to modify their genres, ensuring that every redirected URL aligns with only one genre. • Eliminate any duplicate combinations of genre and its corresponding redirected URL. (8 marks) Question 2d To formulate and perform the following tasks: • Define a function called fetch_genre_description which utilizes BeautifulSoup to extract textual content from a redirected genre URL. • For a redirected URL that have an anchor, only retrieve the textual content associated with that specific section on the webpage. • Call the fetch_genre_description function with and
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 6 of 8’s_films , and display the results. • For genres that are not valid, return None, such as the URL ‘/wiki/Drama_(film_and_television)#Misidentified_categories’. • Remove the newline character ‘\n’ from the scraped genre description. • Call the fetch_genre_description function for each row in the data frame and introduce a new column named genre_description filled with the extracted genre descriptions. • Delete entries that have a blank description. (10 marks) Question 2e To analyze and implement the following tasks: • Utilize the CountVectorizer.fit_transform function from the sklearn library to calculate the word frequency features for each genre, using the processed description data from step Q2(d). • • Only take into account English words and exclude common English stop words. • Utilize the LatentDirichletAllocation algorithm from the sklearn library to group word frequency features into 16 clusters, employing a random seed value of 42. • • Add the clusters computed by the LatentDirichletAllocation.transform method to the data frame from step Q2(c) under a new column named topic. (6 marks) Question 2f The word vocabulary can be found in vectorizer.get_feature_names_out(). 16 identified topics can be retrieved from lda.components_. Every topic is characterized by a vector whose length matches the size of the vocabulary. Words corresponding to higher values in a topic vector play a more significant role in defining that topic. • Extract the top 10 words for each topic from the LDA model’s components and then displays them in a data frame format. • Extract a single observation, such as insights regarding data cleansing or preprocessing, or the categorization of genres based on word distributions. (5 marks)
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 7 of 8 Question 2g Create a visual representation illustrating the correlations between the calculated topics and genres. Extract three observations from the visualization. (4 marks) Question 3 Question 3a Read the dramas.csv file and store its contents in a dataframe called dramas_df and perform the following tasks: • Using the json library, read the contents of drama_details.txt and assign it to the drama_details variable. • Integrate the episode count information from the drama_details variable into the dramas_df dataframe. • Based on the data in the drama_details variable, construct two new dataframes: 1. starrings_df that contains columns (starring_name, starring_url), where starring_url represents the Wikipedia link of the actor or actress. 2. castings_df that has columns (starring_url, drama_url), indicating which actors or actresses starred in which dramas. • Make sure to remove any redundant entries in both starrings_df (based on starring_url) and castings_df (considering all columns). (5 marks) Question 3b Design and save the dramas_df, starrings_df, and castings_df in a SQLite database. (2 marks) Question 3c Which actor or actress has starred in the most number of dramas? (Only one actor / actress is needed.) Address this inquiry by: 1. The sqlite3 library. 2. The SQLAlchemy library. (6 marks)
ICT233 Copyright © 2023 Singapore University of Social Sciences (SUSS) ECA – July Semester 2023 Page 8 of 8 Question 3d What is the number of actors or actresses who took part in several dramas that aired simultaneously, having overlapping broadcast periods end year)? Address this inquiry by applying: 1. The sqlite3 library. 2. The SQLAlchemy library. (6 marks) Question 3e Which pairs of actors or actresses have appeared together in the most dramas? (Provide ALL pairs that share the top spot for co-starring in number of dramas.) (3 marks) Question 3f Q3(f) Which actors/actresses have participated in the highest number of episodes? (3 marks) —– END OF ECA PAPER —–