Real-Time Analysis of Analytics Vidhya Blogathon Leaderboard

Harika Bonthu 23 Mar, 2024 • 8 min read

Introduction

In today’s digital world, we have an ocean of information waiting to be explored online. From tracking the latest trends to understanding what makes a website tick, digging into this data can reveal all sorts of valuable insights. And that’s where web scraping comes in—a nifty technique that lets us gather data from websites automatically. Rather than picking an unknown website, I have decided to work on analysis of Analytics Vidhya’s blogathon page, as we are all familiar with it. Since the current leaderboard does not have much data to deal with, I am using an old leaderboard page with more data points.

Web scraping involves extracting data from websites and converting unstructured information into structured datasets for analysis and visualization. Python offers several libraries, such as BeautifulSoup and Scrapy, which facilitate this process.

The target webpage (AV Blogathon Leaderboard) contains a leaderboard displaying user names and their corresponding views. The idea is to inspect the HTML structure of the webpage, identify the relevant elements, and extract the desired data using BeautifulSoup’s intuitive syntax.

To achieve it, in this article, we will be leveraging Python’s Tkinter library to build a Graphical User Interface (GUI), Selenium to scrape data, and Plotly to visualize leaderboard results.

Learning Outcomes

Learn to extract data from websites and handle dynamic content using Python libraries like BeautifulSoup and Selenium.
Understand how to create interactive and visually appealing plots in Python for exploring and presenting data effectively.
Explore creating graphical user interfaces in Python with Tkinter to develop interactive applications and provide user feedback.
Master techniques to handle errors gracefully and provide informative messages for a smoother user experience in programming projects.

This article was published as a part of the Data Science Blogathon.

Journeying through the Codebase
Illuminating Insights through Data Visualization
Navigating with a Friendly GUI
Asynchronous Data Loading
Enhancing User Experience
Future Directions and Advanced Applications
Tutorial links

Journeying through the Codebase

Let’s start but don’t worry if you’re not a coding whiz just yet! We’ll break down the process step by step.

Step1: Importing Necessary Libraries

import re
import requests
import pandas as pd
import tkinter as tk
from PIL import Image
from bs4 import BeautifulSoup
from selenium import webdriver
import plotly.graph_objects as go
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

In the above code block, we import the libraries and modules required for various tasks within the script.

Step2: Building Data Scrapers Using Python’s request module

def scrape_leaderboard_requests():
    # URL of the Analytics Vidhya leaderboard
    url = "https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/#LeaderBoard"
    
    # Headers for the HTTP request
    headers = {
        'authority': 'datahack.analyticsvidhya.com',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'accept-language': 'en-US,en;q=0.9',
        'cache-control': 'max-age=0',
        'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    }
    
    # Sending an HTTP GET request to fetch the webpage
    response = requests.get(url, headers=headers, verify=False, timeout=80)
    
    # Checking if the request was successful (status code 200)
    if response.status_code == 200:
        # Parsing the HTML content of the webpage using BeautifulSoup
        soup = BeautifulSoup(response.content, 'lxml')
        # Finding the leaderboard table on the webpage
        table = soup.find('table', attrs={'class': 'table-responsive'})
        # Checking if the table exists
        if table:
            print(table) # Print the table content (for debugging)
            # parse table to get names, values data
        else:
            print('no such element found')  # Print a message if the table is not found
    else:
        print('invalid status code')  # Print a message if the HTTP request fails
    
    return names, views

The function above uses the Python requests module to fetch the page content but fails because the content is dynamically loaded using JavaScript. In such cases, we can use Selenium. With Selenium, we can automate web interactions such as clicking buttons, filling out forms, and scrolling through web pages, mimicking human behavior in the virtual realm.

Step3: Building Data Scrapers Using Selenium Library

def get_data(driver, url):
    cur_names = []
    cur_views = []
    driver.get(url)
    driver.implicitly_wait(10)
    all_elements = driver.find_elements(By.CLASS_NAME, 'table-responsive')
    if all_elements:
        last_ele = all_elements[-1]
        leaderboard_table = last_ele.get_attribute('outerHTML')

        soup = BeautifulSoup(leaderboard_table, 'html.parser')
        rows = soup.find_all('tr')
        for row in rows:
            cells = row.find_all('td')
            if len(cells) >= 3:
                cur_names.append(cells[2].text.strip())
                cur_views.append(int(cells[-1].text.strip()))

    return cur_names, cur_views


def scrape_leaderboard():
    print('fetching')
    update_message(msg="Fetching leaderboard results, please wait...")
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--log-level=3")

    chrome_driver_path = "path to chromedriver executable file"

    service = Service(chrome_driver_path)
    driver = webdriver.Chrome(service=service, options=chrome_options)

    url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/#LeaderBoard'

    cur_names, cur_views = get_data(driver, url)
    names.extend(cur_names)
    views.extend(cur_views)

    last_page = None
    pagination_ele = driver.find_element(By.CLASS_NAME, 'page-link')
    if pagination_ele:
        pagination_ele = pagination_ele.get_attribute('outerHTML')
        last_page = re.search('Page\s+\d+\s+of\s+(\d+)', pagination_ele)
        if last_page:
            last_page = int(last_page.group(1))

    if last_page:
        for i in range(2, last_page+1):
            url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/lb/%s/' % i

            cur_names, cur_views = get_data(driver, url)
            names.extend(cur_names)
            views.extend(cur_views)

    driver.quit()

    return names, views

The scrape_leaderboard() function coordinates the scraping process. It initializes a headless Chrome browser using WebDriver, then calls the get_data() function to fetch data from the main leaderboard page and subsequent pages if pagination exists. The script appends the extracted names and views to global lists (names and views), ensuring comprehensive data collection.

The get_data() function is responsible for scraping user names and views from a specified URL. It utilizes Selenium to navigate the webpage and extract data from the leaderboard table using BeautifulSoup.

Illuminating Insights through Data Visualization

Data, in its raw form, can be overwhelming and difficult to comprehend. Data visualization serves as a beacon of light, illuminating patterns, trends, and insights hidden within the data. Plotly, a Python library for interactive data visualization, empowers us to create stunning visualizations that captivate and inform.

From scatter plots to bar charts, Plotly offers a diverse range of visualization options, each tailored to convey specific insights effectively. With its interactive features and customization capabilities, Plotly enables us to engage with data in meaningful ways, unlocking its full potential.

The plot_data function transforms the extracted data into interactive scatter plots using Plotly, a versatile visualization library. These plots offer dynamic exploration capabilities, including hover tooltips with user details, customizable color schemes, and axis labels for enhanced clarity.

def plot_data(df, msg=''):
    update_message(msg="Generating report, please wait...")
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=df['Name'], y=df['Views'], mode='markers',
                         marker=dict(color=df['Views'], colorscale='Viridis', size=10),
                         text=[f"User: {name}<br>Views: {view}" for name, view in zip(df['Name'], df['Views'])],
                         hoverinfo='text'))

    bg_image = Image.open("bg.png")  # Replace "bg.png" with your actual image file
    fig.update_layout(images=[dict(source=bg_image, xref="paper", yref="paper", x=0, y=1, sizex=1, sizey=1, opacity=0.1, layer="below")])

    fig.update_layout(
        xaxis=dict(tickangle=45),
        yaxis=dict(range=[0, df['Views'].max() + 10]),
        template='plotly_dark',
        title='Views by User%s'%msg,
        xaxis_title='User',
        yaxis_title='Views'
    )
    fig.show()
    update_message('Report Generated...')

Navigating with a Friendly GUI

The code integrates a user-friendly GUI using Tkinter, a popular Python GUI toolkit. The GUI features interactive buttons that enable users to generate reports, access additional features, and receive real-time progress updates.

root = tk.Tk()
root.geometry("400x400")
root.title("AV Blogathon Report")

button_frame = tk.Frame(root)
button_frame.pack(side="bottom", pady=20)

button_width = 40
execute_button1 = tk.Button(button_frame, text="Get Leaderboard Report", command=get_full_report, width=button_width)
execute_button1.pack(pady=5)

execute_button2 = tk.Button(button_frame, text="Get Top 'N'", command=get_top_ten, width=button_width)
execute_button2.pack(pady=5)

execute_button3 = tk.Button(button_frame, text='Get article Links of user', command=get_article_link, width=button_width)
execute_button3.pack(pady=5)

message_label = tk.Label(button_frame, text="")
message_label.pack(side="bottom", pady=5)

disable_buttons()

root.after(100, check_data)

root.mainloop()

Asynchronous Data Loading

To optimize user experience, data loading, and GUI initialization occur asynchronously. The check_data function fetches leaderboard data in the background, allowing users to interact with the GUI without interruptions.

import re
import requests
import pandas as pd
import tkinter as tk
from PIL import Image
from bs4 import BeautifulSoup
from selenium import webdriver
import plotly.graph_objects as go
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

names = []
views = []


def scrape_leaderboard_requests():
    url = "https://datahack.analyticsvidhya.com/blogathon/#LeaderBoard"
    headers = {
        'authority': 'datahack.analyticsvidhya.com',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'accept-language': 'en-US,en;q=0.9',
        'cache-control': 'max-age=0',
        'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    }
    response = requests.get(url, headers, verify=False, timeout=80)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'lxml')
        table = soup.find('table', attrs={'class': 'table-responsive'})
        if table:
            print(table)
        else:
            print('no such element found')
    else:
        print('invalid status code')
    return names, views


def update_message(msg, level='info'):
    color = 'green'
    if level=='alert':
        color = 'red'
    message_label.config(text=msg, fg=color)
    root.update()
    return
 

def get_data(driver, url):
    cur_names = []
    cur_views = []
    driver.get(url)
    driver.implicitly_wait(10)
    all_elements = driver.find_elements(By.CLASS_NAME, 'table-responsive')
    if all_elements:
        last_ele = all_elements[-1]
        leaderboard_table = last_ele.get_attribute('outerHTML')

        soup = BeautifulSoup(leaderboard_table, 'html.parser')
        rows = soup.find_all('tr')
        for row in rows:
            cells = row.find_all('td')
            if len(cells) >= 3:  # Ensure the row contains the required data
                cur_names.append(cells[2].text.strip())
                cur_views.append(int(cells[-1].text.strip()))

    return cur_names, cur_views


def scrape_leaderboard():
    print('fetching')
    update_message(msg="Fetching leaderboard results, please wait...")
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--log-level=3")

    chrome_driver_path = path_to_chromedriver # add correct path here

    service = Service(chrome_driver_path)
    driver = webdriver.Chrome(service=service, options=chrome_options)

    # url = "https://datahack.analyticsvidhya.com/blogathon/#LeaderBoard"
    url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/#LeaderBoard'

    cur_names, cur_views = get_data(driver, url)
    names.extend(cur_names)
    views.extend(cur_views)

    last_page = None
    pagination_ele = driver.find_element(By.CLASS_NAME, 'page-link')
    if pagination_ele:
        pagination_ele = pagination_ele.get_attribute('outerHTML')
        last_page = re.search('Page\s+\d+\s+of\s+(\d+)', pagination_ele)
        if last_page:
            last_page = int(last_page.group(1))

    if last_page:
        for i in range(2, last_page+1):
            url = 'https://datahack.analyticsvidhya.com/contest/data-science-blogathon-23/lb/%s/' % i

            cur_names, cur_views = get_data(driver, url)
            names.extend(cur_names)
            views.extend(cur_views)

    driver.quit()

    return names, views


def plot_data(df, msg=''):
    update_message(msg="Generating report, please wait...")
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=df['Name'], y=df['Views'], mode='markers',
                         marker=dict(color=df['Views'], colorscale='Viridis', size=10),
                         text=[f"User: {name}<br>Views: {view}" for name, view in zip(df['Name'], df['Views'])],
                         hoverinfo='text'))

    bg_image = Image.open("bg.png")  # Replace "bg.png" with your actual image file
    fig.update_layout(images=[dict(source=bg_image, xref="paper", yref="paper", x=0, y=1, sizex=1, sizey=1, opacity=0.1, layer="below")])

    fig.update_layout(
        xaxis=dict(tickangle=45),
        yaxis=dict(range=[0, df['Views'].max() + 10]),
        template='plotly_dark',
        title='Views by User%s'%msg,
        xaxis_title='User',
        yaxis_title='Views'
    )
    fig.show()
    update_message('Report Generated...')


def get_full_report():
    plot_data(df)


def get_top_ten():
    df_sorted = df.sort_values(by='Views', ascending=False)
    top_10 = df_sorted.head(10)
    plot_data(top_10, msg='(Top 10)')


def get_article_link():
    update_message('error error error!!! Feature not developed yet.', level='alert')    


def disable_buttons():
    execute_button1.config(state="disabled")
    execute_button2.config(state="disabled")
    execute_button3.config(state="disabled")

def enable_buttons():
    execute_button1.config(state="normal")
    execute_button2.config(state="normal")
    execute_button3.config(state="normal")


def check_data():
    names, views = scrape_leaderboard()
    if not names or not views:
        update_message(msg="No results found. Please try after sometime...", level='alert')
        root.destroy()
        exit()
    else:
        enable_buttons()
        update_message(msg='Data Fetched, please proceed to generate report..')
        global df
        df = pd.DataFrame({'Name': names, 'Views': views})


root = tk.Tk()
root.geometry("400x400")
root.title("AV Blogathon Report")

button_frame = tk.Frame(root)
button_frame.pack(side="bottom", pady=20)

button_width = 40
execute_button1 = tk.Button(button_frame, text="Get Leaderboard Report", command=get_full_report, width=button_width)
execute_button1.pack(pady=5)

execute_button2 = tk.Button(button_frame, text="Get Top 'N'", command=get_top_ten, width=button_width)
execute_button2.pack(pady=5)

execute_button3 = tk.Button(button_frame, text='Get article Links of user', command=get_article_link, width=button_width)
execute_button3.pack(pady=5)

message_label = tk.Label(button_frame, text="")
message_label.pack(side="bottom", pady=5)

disable_buttons()

root.after(100, check_data)

root.mainloop()

Enhancing User Experience

A smooth user experience is paramount in engagement and usability. The codebase incorporates several strategies to enhance user experience and provide real-time insights:

Interactive Visualizations: Plotly’s interactive plots empower users to explore data dynamically, facilitating deeper insights into user engagement trends, outlier detection, and pattern recognition.
Error Handling and Feedback: Robust error handling mechanisms ensure that users are notified of data retrieval failures or unexpected errors. Informative messages and progress updates throughout the data retrieval and visualization process maintain transparency and user engagement.
Customization Options: Users have the flexibility to customize plot attributes such as color schemes, marker sizes, and axis labels to suit their preferences and analytical needs.

Future Directions and Advanced Applications

While the provided code offers a solid foundation for real-time blogathon analytics, the journey doesn’t end here. We can explore several enhancements and advanced applications to elevate the analytics capabilities:

Integration with External APIs: Seamless integration with APIs from Analytics Vidhya or other platforms can streamline data retrieval processes, provide access to datasets, and unlock advanced analytics functionalities.
Advanced Visualization Techniques: Exploring advanced visualization techniques such as heatmaps, network graphs, and animated plots can offer deeper insights into user interactions, collaboration patterns, and content consumption behaviors.
Making the dashboard more interactive: If you observe, the method get_article_link is not developed. A function to return the article links of the given username.

Tutorial links

Selenium: https://youtu.be/2DD-ynCIZ4w?feature=shared
Tkinter: https://youtu.be/yQSEXcf6s2I?feature=shared
Plotly: https://youtu.be/9GYmFXBitBw?feature=shared

Key Take Aways

The code demonstrates automated web scraping using Selenium and BeautifulSoup libraries to extract data from Analytics Vidhya’s leaderboard page.
The code utilizes Plotly, a powerful graphing library, to create interactive scatter plots visualizing user views.
The code integrates a user-friendly graphical user interface (GUI) using Tkinter, allowing users to interact with the scraping and visualization functionalities effortlessly.
Users can generate different types of reports based on their preferences. Users can generate a full leaderboard report, display the top ‘N’ users with the highest views, or access article links of specific users (although this feature is under development).
The code includes mechanisms for error handling and user feedback. If scraping does not find results or encounters an error, the system displays appropriate messages to guide users on the next steps.

Conclusion

This article provides a comprehensive exploration of web scraping, data visualization, and GUI development in Python. By dissecting the codebase, learners gain insights into automated data extraction using BeautifulSoup and Selenium, interactive visualization with Plotly, and building user-friendly interfaces with Tkinter. The article focus on analysis of Analytics Vidhya Blogathon leaderboard, offering practical application of these concepts. Learners can embark on their own data-driven projects, extracting insights, creating engaging visualizations, and designing user interfaces.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

blogathon real time analysis

Harika Bonthu 23 Mar 2024

Data Analysis Intermediate