Spilling the Beans on Visualizing Distribution

Priyanka Ks 10 Dec, 2020 • 5 min read

This article was published as a part of the Data Science Blogathon.

visualizing distribution

Introduction to distribution and drawing visual reference

Distribution in the English language dictionary has been defined as the way in which something is spread over an area. The good part about definitions, in general, is that they draw their strengths from languages and word meanings. If you are able to identify statistically significant key words hiding in definitions, share your comments. Write about how you understand them in the comments section or here.

 

Definitions in Statistics

Distribution: is defined as the possible values for a variable and how often they occur[1].

The word with statistical significance hiding in that definition is variable.

Variable has two meanings [2]

  • A variable is an attribute that describes a person, place, thing, or idea.
  • The value of the variable can “vary” from one entity to another.

Distribution is like finding patterns from spiled beads. This article will draft a visual description and overlay statistics over it.

Distribution is like beans spilled from a bag. Quite literally. Step back and imagine a dataset with lots of numbers on it. If each number represented a bean, numeric data will be like beans that spilled from that bag. Another parallel that comes to mind is from my days at coffee trading. The bag of beans that came from the estate had beans of all sizes. The first effort would be to sort them first by shape then by size.

Datum: Datum in statistics means one item of information, one fact, one statistic. On its own a datum, the singular form of the more commonly used “data.” In English, it means a piece of information.

Datum is a lot like the spilled bean. It can have a value large and small. Sorting is any process of arranging items systematically. Systematic stems from the word system. The column header and row index create the system. The intersection of the row and column gives meaning to what is otherwise just a number. Visually the spread-sheet is ready.

Conventional primary data collection would start with defining the column header. Row index gets generated as we collect the data points. Transpose of that is also valid but the process is the same. The spreadsheet we constructed in the previous para is a corollary to conventional primary data collection methods. End state to both the approach is the standard spreadsheet. Rows columns and numbers. Fewer set of a datum can be managed using the familiar excel sheets and .csv files. Larger collections of datum will relay concepts of Big data.

 

The analysis is Interplay

Once the data set is available, phase 2 begins. Analyzing data would be an exercise of dealing with and the interplay between numbers. Tying that to headers and building a narrative. Visualization using tools and libraries.

Data analysis is a method in which data is collected and organized so that one can derive helpful information from it. Remember that the datum is sitting in a row or column. More preciously datum is sitting at the intersection of rows and columns.

 

As an analyst what do you do?

You can compare the values that are within the column. Call it univariate analysis. Where all comparisons and conclusions are for everything that is within a column. Uni meaning one, single column heading that all datum will refer to. Similarly, comparisons of 2 columns will be called bivariate. Two or more will be referred to as multivariate.

Giving a column name and collecting data points below is one of the most popular methods of recording. Probably it is rooted in how the human eye moves across a flat surface when reading off it. Compare heat maps of reading patterns. 2 of the 4 most popular reading patterns had a higher concentration in the column format. Regardless of the arrangement, the story is not different if the analysis was being done row-wise. Anything tied to a single header would continue to be called univariate and so on.

 

Let’s overlay statistics on a bag of spilled beans: 

The purpose of the effort is to describe, summarizes that data, and finds patterns. Some ways you can describe patterns found in univariate data include central tendency (mean, mode and median) and dispersion: range, variance, maximum, minimum, quartiles (including the interquartile range), and standard deviation[3]. Visualization tools would include Frequency tables, Bar graphs, pie charts, histograms, box plots, and frequency polygons.

Coffee beans getting compared to other coffee beans for size will be a univariate analysis. Coffee beans compared to coffee beans for size and shape will be bivariate analysis. Coffee bean and rice grain getting compared for size will be bivariate analysis.

Interestingly a random spread of anything statistics says will have a pattern. If it was coffee beans spilled all over, you can see and tell. Data is a little more complicated to see and tell. You would need to rely on mathematical theories and machine learning models to figure it out. Something like spilling the beans, making a mess, and vanishing from the scene. Now to someone else who does not know need to figure things out.

The one who tries to figures it out is the analyst. Most answers will come in the form of this was ‘probably’ that. To figure out the datum we now extensively rely on machine learning models. These models rely on the theory of probability. They would either cluster or classify. If you are a student of machine learning (ML) head over to the comments section and share your definitions of the hidden keywords. The theory of normal distribution has some interesting insights on any bag of beans that got spilled. It says Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean. Data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.[4] If you know the average size of the coffee bean, then the theory of normal distribution will help you figure out what percentage of those beans will be what size ( approximately). Cool eh!

Let’s refer to questions relating to size. For the humble coffee bean, it will be called size and for numeric data, it will be valuable. For sake of simplicity let’s say all beans that got spilled from the bag had a positive value. Normalization is a necessary consideration when attempting bivariate analysis. Say the bag of beans had coffee beans and rice. Normalization would be a technique to bring the size of a datum to a common scale. It is like applying a mathematical formula to be able to compare coffee beans and rice grain. Very different from each other. Different range of size among many others. Normalize columns that have the size of both rice and coffee. Normalization is the magic wand that helps you add rice and coffee beans and still make sense of the exercise.

 

End Notes

It is anomalies like these where we mix coffee with rice, that makes data science interesting and complex. Intelligence lies in what can be connected. The tools are there to help.

So enjoy figuring out how to add rice and coffee. There is a whole host of python libraries and exploratory data analysis to figure what the spill is all about.

The author of this article will encourage you to share your own definitions of statically significant keywords. Head over to the comments section here or here.


[1] https://365datascience.com/explainer-video/distribution-in-statistics/
[2] https://stattrek.com/statistics/dictionary.aspx?definition=variable
[3] https://www.statisticshowto.com/univariate/#:~:text=Univariate%20analysis%20is%20the%20simplest,finds%20patterns%20in%20the%20data.
[4] https://www.investopedia.com/terms/n/normaldistribution.asp
Priyanka Ks 10 Dec 2020

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear