Statistical Functions in Python

In this tutorial, we would be covering some useful statistical functions which can be applied to pandas and series objects.



Statistical Functions in Python
Photo by Andrea Piacquadio

 

Statistical functions are of great help in analyzing the data and making meaningful conclusions. In this tutorial, we would be covering some useful statistical functions which can be applied to pandas and series objects

The following statistical functions would be covered in the tutorial:

  • pct_change()
  • cov ()
  • corr ()
  • corrwith ()

 

pct_change()

 

The method pct_change () can be applied to a panda’s series and Data Frame to calculate the percent change over a specific number of periods

 

Calculating pct_change() without specifying the number of periods

 

Code:

import pandas as pd
import numpy as np

series = pd.Series(np.random.randn(10))

series.pct_change()


Output:

0         NaN

1   -0.881470

2   -5.025007

3    0.728078

4   -0.577371

5    1.173420

6   -1.578389

7   -3.520208

8   -1.927874

9   -1.600583

dtype: float64


Calculating pct_change() by specifying the number of periods

 

Code:

df = pd.DataFrame(np.random.randn(10,2))

df.pct_change(periods = 2)


Output:

0 1
0 NaN NaN
1 NaN NaN
2 -0.095052 -1.399525
3 0.073909 -7.491512
4 -0.882174 -1.150202

 

Covariance: cov()

 

The method cov () is used to calculate the covariance in a series and Data Frame. While calculating the covariance in a Data Frame, pairwise covariance is calculated amongst the series in a Data Frame.

While calculating the covariance in series and Data Frame missing values are excluded if any

 

Calculating covariance between two series

 

Code:

series1 = pd.Series(np.random.randn(200))
series2 = pd.Series(np.random.randn(200))

series1.cov(series2)


Output:

-0.14817157321848334


Calculating covariance of a Data Frame

 

Code:

df = pd.DataFrame(np.random.randn(4,5),columns = ["a","b","c","d","e"])
df.cov()


Output:

a b c d e
a 2.095402 0.191502 0.049185 0.090229 -1.052856
b 0.191502 0.628889 0.377184 -0.507893 0.404180
c 0.049185 0.377184 0.336220 -0.077814 0.571139
d 0.090229 -0.507893 -0.077814 0.950198 0.164894
e -1.052856 0.404180 0.571139 0.164894 1.722546

 

 

Correlation: corr ()

 

Correlation is computed using the corr () method, the corr () method has a method parameter that has the following method name option's available:

  1. Pearson(default) which is the Standard correlation coefficient
  2. Kendall Tau correlation coefficient
  3. Spearman rank correlation coefficient

 

Calculating the correlation between series in a Data Frame using the default Pearson

 

Code:

df = pd.DataFrame(np.random.randn(200,4), columns = ["a","b","c","d"])
df["a"]. corr(df["b"])


Output:

0.08425780768544051


Calculating the correlation between series in a Data Frame using the method spearman

 

Code:

df["a"]. corr(df["b"],method = "spearman")


Output:

0.053819845496137414


Calculating the pairwise correlation between Data Frame columns

 

Code:

df.corr()


Output:

a b c d
a 1.000000 0.084258 -0.074284 0.054453
b 0.084258 1.000000 0.022995 0.029727
c -0.074284 0.022995 1.000000 -0.028279
d 0.054453 0.029727 -0.028279 1.000000

 

corrwith ()

 

Corrwith () method is applied to a Data Frame  to calculate the correlation between the same - labeled Series in different Data Frame objects

Code:

index = ["a","b","c","d","e"]

columns = ["one","two","three","four"]

df1 = pd.DataFrame(np.random.randn(5,4), index = index, columns = columns )

df2 = pd.DataFrame(np.random.randn(4,4), index = index[:4], columns = columns)

df1.corrwith(df2)


Output:

one      0.277569

two     -0.052151

three   -0.754392

four     0.526614

dtype: float64


Code:

df2.corrwith(df1, axis=1)


Output:

a    0.346955

b   -0.707590

c    0.711081

d    0.753457

e         NaN

dtype: float64


 
Priya Sengar (Medium, Github) is a Data Scientist with Old Dominion University. Priya is passionate about solving problems in data and converting them into solutions.