Query Your Pandas DataFrames with SQL

Learn how to query your Pandas DataFrames using the standard SQL SELECT statement, seamlessly from within your Python code.



Query Your Pandas DataFrames with SQL
Photo by Jeffrey Czum from Pexels (edits by author)

 

Pandas — or, more specifically, its primary data container, the DataFrame — has long ago solidified itself as the standard tabular data storage structure in the Python data ecosystem. Using the Pandas DataFrame comes with its own specifications for accessing, manipulating, and performing computations on composite data, specifications which can be mastered with time and persistence thanks to its conformity to Python syntax; if you are intimately familiar with Python, you are already well on your way to being proficient with the standard Pandas API.

There are a lot of users who are not so proficient with Python, however, who instead have made the jump to the language thanks to its prominent position in data science and data analysis, as well as related fields and professions. Many of those making the jump may also already be familiar with another of the heavyweight data science languages: SQL. So why can we not query data in our Pandas DataFrames using SQL, a language written specifically with tabular data access in mind?

Enter Fugue. Fugue is a project created with the decoupling of logic and execution in mind, primarily with the goal of simplifying computational parallelization.

 

Fugue is an abstraction framework that lets users write code in native Python or Pandas, and then port it over to Spark and Dask.

 

A core component of this effort is FugueSQL. FugueSQL is not pure SQL; it describes its syntax as a mix "between standard SQL, json and python." However, you should find that for basic querying, it operates more or less as you would expect. Specifically, it is designed "[t]o be fully compatible with standard SQL SELECT statement." As such, it is generally more straightforward for newcomers to data munging in Pandas than is querying Pandas in Python.

The benefits of FugueSQL can be realized in parallel code, alongside the additional goals of the Fugue project, or standalone, querying your DataFrames on a lone computer. In this article we will take a look at how to get up and running with FugureSQL in your own work right away, and examples of how to do so.

 

Installation and Preparation

 
If you want to install the full Fugue library to take advantage of all of its parallelization offerings, you can install as such:

pip install fugue


If, instead, you only want the FugueSQL API — which is all that is required for the examples in this article — install as such:

pip install fugue[sql]


Once installed, import the required libraries:

import pandas as pd
from fugue_sql import fsql


With that, all that is left is to create a DataFrame we can use to query in our examples. Since it is up to me to do so — and since I'm currently knee-deep in organizing my comic book collection — let's go ahead and create a simple toy comic book dataset.

The simple DataFrame will consist of 4 columns: a particular issue of a comic book title; the comic book publisher; the professionally- and independently-assigned comic book grade (on a 10 point grading scale); the value of the comic book in its assigned grade.

Here is the code to accomplish what is laid out above:

comics_df = pd.DataFrame({'book': ['Secret Wars 8',
                                   'Tomb of Dracula 10',
                                   'Amazing Spider-Man 252',
                                   'New Mutants 98',
                                   'Eternals 1',
                                   'Amazing Spider-Man 300',
                                   'Department of Truth 1'],
                          'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
                          'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
                          'value': [400, 2500, 300, 600, 400, 750, 175]})

print(comics_df)


                     book publisher  grade  value
0           Secret Wars 8    Marvel    9.6    400
1      Tomb of Dracula 10    Marvel    5.0   2500
2  Amazing Spider-Man 252    Marvel    7.5    300
3          New Mutants 98    Marvel    8.0    600
4              Eternals 1    Marvel    9.2    400
5  Amazing Spider-Man 300    Marvel    6.5    750
6   Department of Truth 1     Image    9.8    175


You don't need to understand (or care) anything more about comic book collecting to make sense of the following examples, so let's get into them.

 

Example 1: Which of my books are graded above 8.0?

 
The section title says it all. We are going to use FugueSQL and the standard SELECT statement to query our Pandas DataFrame to determine which of my comic books are graded above 8.0 on a 10 point scale.

To do this, we first need to define the SQL statement:

# which of my books are graded above 8.0?
query_1 = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""


Note the use of the Pandas DataFrame as the table name in our SELECT statement.

Once the query is defined, it must be executed using the Fugue query engine:

fsql(query_1).run()


And here is our returned query:

PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8                                                 |Marvel       |9.6         |400       
Eternals 1                                                    |Marvel       |9.2         |400       
Department of Truth 1                                         |Image        |9.8         |175       
Total count: 3


There are 3 books in our DataFrame graded above 8.0: Secret Wars 8; Eternals 1; and Department of Truth 1.

And if you compare this result to our original DataFrame, you will find that our query has been returned appropriately. It worked! Very cool, and, if you are familiar with SQL syntax, very straightforward.

 

Example 2: Which of my books are valued above 500?

 
Let's look at a second example, where we will find which of these books are valued above $500, given their grade and the current market at this article's publication time.

Again, we will first define the SQL statement, and then execute it using the FugueSQL engine.

# which of my books are valued above 500?
query_2 = """
SELECT book, publisher, grade, value FROM comics_df
WHERE value > 500
PRINT
"""

fsql(query_2).run()


And the results:

PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Tomb of Dracula 10                                            |Marvel       |5.0         |2500      
New Mutants 98                                                |Marvel       |8.0         |600       
Amazing Spider-Man 300                                        |Marvel       |6.5         |750       
Total count: 3


Of the books in our DataFrame, 3 are valued above $500. Each of these happen to be the first full appearance of 3 very popular characters, leading to their relatively high values: Blade the Vampire Hunter; Deadpool; and Venom, respectively.

 

Example 3: Which of my books are published by Image?

 
For our last example, let's look at which of these comics were published by Image Comics.

# which of my books are published by Image?
query_3 = """
SELECT book, publisher, grade, value FROM comics_df
WHERE publisher = 'Image'
PRINT
"""

fsql(query_3).run()


PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Department of Truth 1                                         |Image        |9.8         |175       
Total count: 1


Only 1 of these fit the bill: Department of Truth 1, a masterfully-written contemporary masterpiece based on the idea that popular belief in conspiracy theories actually alters reality, is the only one of these books published by Image.

And what is the data type of the returned query result? Let's re-run the above query, store it to a variable, and inspect its type.

result = fsql(query_3).run()
print(type(result))


fugue.dataframe.dataframes.DataFrames


The returned result is in the form of a Fugue dataframe. Find out more about the Fugue dataframe here, and how to further process returned results.

 
For additional examples, see this FugueSQL tutorial.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.