KDnuggets Top Blog Winner

SQL and Python Interview Questions for Data Analysts

Walking you through the most important SQL and Python technical concepts and four interview questions to practice for the Data Analyst position.



SQL and Python Interview Questions for Data Analysts
Image by Author

 

Data analyst – probably the most straightforward job title of all data professionals. There’s not much thinking involved in deciphering what data analysts do. Call me Captain Obvious: data analysts analyze data.

Sure, they need diverse skills such as data visualization, analytical mindset, communication, and business intelligence.

But to make the most out of those skills, they have to handle and analyze data somehow. Working with large data sets requires knowledge of a programming language or two. The two probably the most popular languages used in data analysis are SQL and Python.

You’ll use them daily in most data analysis jobs. No wonder job interviews for these positions mostly revolve around testing your SQL and Python skills.

Let me show you several typical interview examples that test different technical concepts useful to any data analyst.

 

Data Analyst SQL Interview Questions

 

Data Analyst SQL Interview Questions
Image by Author

 

Question #1: Days At Number One (PostgreSQL)

 

“Find the number of days a US track has stayed in the 1st position for both the US and worldwide rankings. Output the track name and the number of days in the 1st position. Order your output alphabetically by track name.

If the region 'US' appears in dataset, it should be included in the worldwide ranking.”

Here’s the link to the question if you want to follow along with me.

 

Technical Concepts

 

This problem requires you to know the following SQL concepts:

  • Data Aggregation
  • Subqueries
  • CASE Statement
  • Window Functions
  • JOINs
  • Filtering Data
  • Grouping Data
  • Sorting Data

These are also concepts that you’ll use most often as a data analyst.

 

Solution & Output

 

Get familiar with the code below, and then I’ll explain it. This code is written in PostgreSQL.

SELECT 
  trackname, 
  MAX(n_days_on_n1_position) AS n_days_on_n1_position 
FROM 
  (
    SELECT 
      us.trackname, 
      SUM(
        CASE WHEN world.position = 1 THEN 1 ELSE 0 END
      ) OVER(PARTITION BY us.trackname) AS n_days_on_n1_position 
    FROM 
      spotify_daily_rankings_2017_us us 
      INNER JOIN spotify_worldwide_daily_song_ranking world ON world.trackname = us.trackname 
      AND world.date = us.date 
    WHERE 
      us.position = 1
  ) tmp 
GROUP BY 
  trackname 
ORDER BY 
  trackname;

 

I’ll start the explanation from the subquery. Its purpose is to find tracks that were ranked first in the US and worldwide rankings.

This subquery sets the condition using the CASE statement, which searches for the tracks in the first position worldwide. This statement is part of the SUM() window function. It returns the number of days each track that satisfies the condition spent as number one.

To get this, you need to use data from both available tables, and you need JOINs. In this case, it’s INNER JOIN because you’re interested only in tracks from both tables. Join the tables on the track name and date.

The question asks you to output only tracks that were ranked first. You need to filter data using the WHERE clause to get that.

The subquery is then used in the main SELECT statement. The main query references the subquery and uses the MAX() aggregate function and GROUP BY to return the longest streak on the first position by track.

Finally, the result is sorted alphabetically by the track name.

trackname n_days_on_n1_position
Bad and Boujee (feat. Lil Uzi Vert) 1
HUMBLE. 3

 

If you need more clarification on how to approach this data analyst interview question, my team and I prepared a walkthrough video that might help.



 

Question #2: Trips and Users (MySQL)

 

“The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day.

Write a SQL query to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03". Round Cancellation Rate to two decimal points.

Return the result table in any order.

The query result format is in the following example.”

Here’s the link to this data analyst interview question if you want to follow along with me.

 

Technical Concepts

 

To solve this question, you’ll need most of the concepts you used in the previous. However, there are also some additional ones:

  • CTE
  • Rounding the Numbers
  • Casting Data Types

 

Solution & Output

 

The solution is written in MySQL.

WITH stats AS
  (SELECT request_at,
          t.status <> 'completed' AS canceled
   FROM trips t
   JOIN users c ON (client_id = c.users_id
                    AND c.banned = 'no')
   JOIN users d ON (driver_id = d.users_id
                    AND d.Banned = 'no')
   WHERE request_at BETWEEN '2013-10-01' AND '2013-10-03' )

SELECT request_at AS Day,
       ROUND(CAST(SUM(canceled) AS FLOAT)/CAST(COUNT(*) AS FLOAT), 2) AS 'Cancellation Rate'
FROM stats
GROUP BY Day
ORDER BY Day;

 

Let’s first focus on the CTE; this one’s named stats. It’s a SELECT statement that returns the date of the request and its status, where the status is not ‘completed’. In other words, the request is canceled.

The request can be canceled both by the client or driver. So this query needs JOIN twice. The first time, the trips are joined with the users to get the requests canceled by the client who wasn’t banned. The other JOIN uses the same table to get the requests canceled by the drivers.

This data analyst interview question asks to include only certain dates, and this criterion is stated in the WHERE clause.

Now comes another SELECT statement that references the CTE. It divides the number of canceled requests by the total number of requests. This is done using two aggregate functions: SUM() and COUNT(). Also, the ratio has to be changed to a decimal number and rounded to two decimal places.

Finally, the output is grouped and ordered by day

Day  Cancellation Rate
2013-10-01 0.33
2013-10-02 0
2013-10-03 0.5

 

Data Analyst Python Interview Questions

 

Data Analyst Python Interview Questions
Image by Author

 

Question #3: Product Families

 

“The CMO is interested in understanding how the sales of different product families are affected by promotional campaigns. To do so, for each product family, show the total number of units sold, as well as the percentage of units sold that had a valid promotion among total units sold. If there are NULLS in the result, replace them with zeroes. Promotion is valid if it's not empty and it's contained inside promotions table.”

Here’s the link to the question if you want to follow along with me.

 

Technical Concepts

 

Doing data analysis in Python is one of the much appreciated, often mandatory, skills for data analysts. While Python offers a lot of possibilities for data analysis, this usually isn’t enough. You’ll also have to use different data analysis libraries, such as Pandas and NumPy.

In solving this data analyst interview question, you’ll need to be fluent in using the following concepts:

  • merge()
  • lambda functions
  • isna()
  • unique()
  • groupby()
  • data aggregation
  • Working with DataFrames

 

Solution & Output

 

Here’s how to solve this problem in Python.

import pandas as pd

merged = facebook_sales.merge(
    right=facebook_products, how="outer", on="product_id"
)
merged["valid_promotion"] = merged.promotion_id.map(
    lambda x: not pd.isna(x)
    and x in facebook_sales_promotions.promotion_id.unique()
)

valid_promotion = merged[merged.valid_promotion]

invalid_promotion = merged[~merged.valid_promotion]

result_valid = (
    valid_promotion.groupby("product_family")["units_sold"]
    .sum()
    .to_frame("valid_solds")
    .reset_index()
)

result_invalid = (
    invalid_promotion.groupby("product_family")["units_sold"]
    .sum()
    .to_frame("invalid_solds")
    .reset_index()
)

result = result_valid.merge(
    result_invalid, how="outer", on="product_family"
).fillna(0)

result["total"] = result["valid_solds"] + result["invalid_solds"]
result["valid_solds_percentage"] = (
    result["valid_solds"] / result["total"] * 100
)

result = result[
    ["product_family", "total", "valid_solds_percentage"]
].fillna(0)

 

Let’s go through the code. First, I merge facebook_sales and facebook_products using the right outer method.

Then I use the new column valid_promotion to find sales made under a valid promotion. In other words, find the promotion ID both in sales and promotions data.

After that, I split the output into valid and invalid sales. Both types of sales are summed and grouped by the product family.

The two DataFrames are again merged to show the valid and invalid sales by the product type. The NA values are replaced with 0.

Now that I got these values, I can find a percentage of the valid sales.

Finally, the output shows the product family, the total sales, and the valid sales percentage.

product_family total valid_solds_percentage
CONSUMABLE 103 100
GADGET 86 76.744
ACCESSORY 0 0

 

Again, here’s a video walkthrough of this solution.



 

Question #4: 3 Sum Closest

 

“Given an integer array nums of length n and an integer target, find three integers in nums such that the sum is closest to target.

Return the sum of the three integers.

You may assume that each input would have exactly one solution.”

Here’s the link to the question if you want to follow along with me.

 

Technical Concepts

 

Data analysts don’t regularly need to write algorithms. But when they do, it might be something that can help them in data analysis. This data analyst interview question is such an example because it asks you to find the closest sum to the target. This or something similar is done using Solver in Excel.

But why not get a little more sophisticated? This sophistication requires knowing these concepts:

  • Defining the Function
  • Defining the Data Type
  • Sorting Data
  • For Loops
  • range()
  • len()
  • abs()

 

Solution & Output

 

Here how you write this algorithm.

class Solution:
    def threeSumClosest(self, nums: List[int], target: int) -> int:
        diff = float("inf")
        nums.sort()
        for i in range(len(nums)):
            lo, hi = i + 1, len(nums) - 1
            while lo < hi:
                sum = nums[i] + nums[lo] + nums[hi]
                if abs(target - sum) < abs(diff):
                    diff = target - sum
                if sum < target:
                    lo += 1
                else:
                    hi -= 1
            if diff == 0:
                break
        return target - diff

 

First, define the function threeSumClosest. The input, output, and difference data have to be integers, so define them as such and sort the input array.

Then create the for loop and define the current position and pointers. After that comes setting up the criteria for the loop.

When the lo pointer is below the hi pointer, the sum is their sum plus the current position.

If the absolute value of the target and the result difference is lower than the absolute difference, then set the difference to target - sum.

If the result is below the target, increase the lo pointer by one. If not, then decrease the hi pointer by one. If the difference is zero, end the loop and show the output, which is the target minus difference.

This is Case 1 and the algorithm output.

Input
nums = [-1,2,1,-4]
target = 1

 

Output
2

 

And for Case 2:

Input
nums = [0,0,0]
target = 1

 

Output
0

 

Summary

 

These four data analyst interview questions are only examples. They, for sure, are not the only questions you need to go through before the interview.

However, they are excellent examples of what you can expect. Also, I chose them carefully so that they cover the most SQL and Python concepts data analysts need.

The rest is on you! Practice coding in SQL and Python and solve as many actual data analyst interview questions. But also don’t forget to use other resources and practice other data analysis skills.

Coding is important, but it’s not everything.

 
 
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.