Pandas Groupby function using Python¶

This is one of the most useful analytics and preprocessing tools of Pandas. As the name Groupby suggests, it groups your data by something. Normally, you would want to group your data by categorical attributes.¶

If you are familiar with SQL queries, Pandas groupby is almost identical to SQL groupby. For both SQL queries and Pandas queries, grouping your data by itself will not have any added value or any output, unless it is accompanied by an aggregate function.¶

It replicates the Pivot Table function of MS Excel. We can replicate all the functionalities of excel pivot table in python using the groupby function.¶

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, can be a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a versatile groupby interface, enabling you to slice, dice, and summarize datasets in a natural way. In this article you wll learn the following:¶

Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)¶

Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function¶

Compute pivot tables and cross-tabulations¶

Perform statistical analysis and other group analyses¶

Import the packages¶

In [1]:

import pandas as pd
import numpy as np
import os
os.chdir("C:\\AnalyticsEducator\\Python Free Course")

Import the dataset¶

In [2]:

data = pd.read_csv("Pivot_data.csv")
data.head(3)

Out[2]:

	Country	Category	Department	Price	Revenue
0	Netherlands	Furniture	Dining	499.0	649199.0
1	Netherlands	Furniture	Dining	699.0	638187.0
2	Netherlands	Furniture	Dining	549.0	632997.0

If you want to count the number of rows per Country, you can use the groupby function with the following code:¶

In [3]:

data.groupby("Country").size()

Out[3]:

Country
India          100
Japan          100
Netherlands    100
dtype: int64

You can group the DataFrame by more than one column as needed. To do so, you will have to introduce the columns you are grouping the DataFrame by in the form of a list of column names.¶

Size is the only function which works without mentioning any variable name to be used for value area.¶

In [17]:

data.groupby(["Country","Category"]).size()

Out[17]:

Country      Category 
India        Furniture    21
             Home         26
             Textiles     53
Japan        Furniture    14
             Home         16
             Lighting      8
             Textiles     62
Netherlands  Furniture    54
             Home          6
             Textiles     40
dtype: int64

If we want to know the average revenue by Country and Category then we will use the following code:¶

In [43]:

data.groupby(["Country","Category"]).Revenue.mean()

Out[43]:

Country      Category 
India        Furniture     45829.000000
             Home          36411.488077
             Textiles      40760.823585
Japan        Furniture     15987.214286
             Home          15259.449375
             Lighting      14896.080000
             Textiles      15727.381452
Netherlands  Furniture    201314.185185
             Home         168940.676667
             Textiles     117717.962750
Name: Revenue, dtype: float64

We could have also used the following functions, in place of average (mean)¶

count()    --> Number of non null values
sum()      --> Sum of values
mean()     --> Average of values
median()   --> Median of values
mode()     --> Mode of values
mad()      --> Mean Absolute Deviation
std()      --> Standard Deviation
var()      --> Variance
min()      --> Minimum
max()      --> Maximum
describe() --> Different descriptive statistics together

We may use the agg function (alias of aggregate) to show different statistics with groupby function¶

Here we are calculating the mean, standard deviation, sum, count, minimum, maximum and median for Revenue, grouped by first Country and then by Category¶

In [62]:

data.groupby(["Country","Category"]).Revenue.agg(["mean","std","sum","count","min","max","median"])

Out[62]:

		mean	std	sum	count	min	max	median
Country	Category
India	Furniture	45829.000000	11459.290999	962409.00	21	29952.00	61035.00	45760.000
	Home	36411.488077	8417.442794	946698.69	26	26632.80	62038.60	33744.405
	Textiles	40760.823585	11411.131060	2160323.65	53	26595.00	64841.00	37228.950
Japan	Furniture	15987.214286	7799.005333	223821.00	14	1280.00	25948.00	15982.000
	Home	15259.449375	6067.927488	244151.19	16	6494.50	24675.00	15018.415
	Lighting	14896.080000	5452.128320	119168.64	8	6987.67	21805.00	15406.625
	Textiles	15727.381452	5825.758681	975097.65	62	2818.59	25991.35	16785.190
Netherlands	Furniture	201314.185185	156258.926896	10870966.00	54	68229.00	649199.00	155163.000
	Home	168940.676667	89720.689814	1013644.06	6	66547.77	319304.00	155192.000
	Textiles	117717.962750	68257.960203	4708718.51	40	65484.45	431584.10	101217.425

We may use the agg function to create different aggregations as per separate variables¶

In the below code, we are calculating mean of Revenue, and sum & maximum of Price for different Countries¶

In [59]:

data.groupby(["Country"]).agg({"Revenue":"mean","Price":["sum","max"]})

Out[59]:

	Country	Revenue	Price
		mean	sum	max
0	India	40694.3134	6899.98	529.0
1	Japan	15622.3848	5766.55	699.0
2	Netherlands	165933.2857	28428.14	899.0

We may use the unstack function to put the inner index (2nd column from the left of the previous output i.e. Category will be put into column area¶

In [21]:

data.groupby(["Country","Category"]).Revenue.mean().unstack()

Out[21]:

Category	Furniture	Home	Lighting	Textiles
Country
India	45829.000000	36411.488077	NaN	40760.823585
Japan	15987.214286	15259.449375	14896.08	15727.381452
Netherlands	201314.185185	168940.676667	NaN	117717.962750

We may also put the name of the column/columns in form of a list (by putting the different column names into square bracket) within unstack function to put it into the column area¶

In [36]:

data.groupby(["Country","Category"]).Revenue.mean().unstack("Country")

Out[36]:

Country	India	Japan	Netherlands
Category
Furniture	45829.000000	15987.214286	201314.185185
Home	36411.488077	15259.449375	168940.676667
Lighting	NaN	14896.080000	NaN
Textiles	40760.823585	15727.381452	117717.962750

Suppose we have put both Country and Category into the column area and save the result into a data frame¶

In [41]:

Analytics_Educator = data.groupby(["Country","Category","Department"]).Revenue.sum().unstack(["Country","Category"])
Analytics_Educator

Out[41]:

Country	India			Japan				Netherlands
Category	Furniture	Home	Textiles	Furniture	Home	Lighting	Textiles	Furniture	Home	Textiles
Department
Bath	NaN	129455.13	328633.10	NaN	38014.64	NaN	344259.95	NaN	NaN	719079.49
Bedroom	82495.0	NaN	869674.70	72535.0	NaN	NaN	230126.56	803198.0	126684.00	3647101.85
Dining	73210.0	NaN	272442.00	NaN	30796.50	NaN	163830.19	3960201.0	NaN	81455.25
Entryways	52500.0	234005.00	NaN	NaN	40787.50	NaN	NaN	553007.0	NaN	NaN
Kids Room	223299.0	NaN	404392.09	47285.0	NaN	30813.25	86940.00	1327027.0	NaN	261081.92
Kitchen	NaN	583238.56	285181.76	NaN	134552.55	NaN	149940.95	NaN	886960.06	NaN
Living Room	530905.0	NaN	NaN	104001.0	NaN	88355.39	NaN	4227533.0	NaN	NaN

Now we may use the stack function, which is just the opposite of unstack function, to push the variables from the column area back into row area¶

In [42]:

Analytics_Educator.stack(["Country","Category"])

Out[42]:

Department   Country      Category 
Bath         India        Home          129455.13
                          Textiles      328633.10
             Japan        Home           38014.64
                          Textiles      344259.95
             Netherlands  Textiles      719079.49
Bedroom      India        Furniture      82495.00
                          Textiles      869674.70
             Japan        Furniture      72535.00
                          Textiles      230126.56
             Netherlands  Furniture     803198.00
                          Home          126684.00
                          Textiles     3647101.85
Dining       India        Furniture      73210.00
                          Textiles      272442.00
             Japan        Home           30796.50
                          Textiles      163830.19
             Netherlands  Furniture    3960201.00
                          Textiles       81455.25
Entryways    India        Furniture      52500.00
                          Home          234005.00
             Japan        Home           40787.50
             Netherlands  Furniture     553007.00
Kids Room    India        Furniture     223299.00
                          Textiles      404392.09
             Japan        Furniture      47285.00
                          Lighting       30813.25
                          Textiles       86940.00
             Netherlands  Furniture    1327027.00
                          Textiles      261081.92
Kitchen      India        Home          583238.56
                          Textiles      285181.76
             Japan        Home          134552.55
                          Textiles      149940.95
             Netherlands  Home          886960.06
Living Room  India        Furniture     530905.00
             Japan        Furniture     104001.00
                          Lighting       88355.39
             Netherlands  Furniture    4227533.00
dtype: float64

Analytics Educator is the best institute for Data Science courses, based out of Kolkata. We provide training to all aspirants, who are willing to learn Data Science and get a job, across the globe. We specialize in providing training on data science even to students coming from non-technical background with zero programming or statistical knowledge. We help the associates to learn data science and get job in this field.¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

Pandas Groupby function using Python¶

This is one of the most useful analytics and preprocessing tools of Pandas. As the name Groupby suggests, it groups your data by something. Normally, you would want to group your data by categorical attributes.¶

If you are familiar with SQL queries, Pandas groupby is almost identical to SQL groupby. For both SQL queries and Pandas queries, grouping your data by itself will not have any added value or any output, unless it is accompanied by an aggregate function.¶

It replicates the Pivot Table function of MS Excel. We can replicate all the functionalities of excel pivot table in python using the groupby function.¶

Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)¶

Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function¶

Compute pivot tables and cross-tabulations¶

Perform statistical analysis and other group analyses¶

Import the packages¶

Import the dataset¶

If you want to count the number of rows per Country, you can use the groupby function with the following code:¶

You can group the DataFrame by more than one column as needed. To do so, you will have to introduce the columns you are grouping the DataFrame by in the form of a list of column names.¶

Size is the only function which works without mentioning any variable name to be used for value area.¶

If we want to know the average revenue by Country and Category then we will use the following code:¶

We could have also used the following functions, in place of average (mean)¶

We may use the agg function (alias of aggregate) to show different statistics with groupby function¶

Here we are calculating the mean, standard deviation, sum, count, minimum, maximum and median for Revenue, grouped by first Country and then by Category¶

We may use the agg function to create different aggregations as per separate variables¶

In the below code, we are calculating mean of Revenue, and sum & maximum of Price for different Countries¶

We may use the unstack function to put the inner index (2nd column from the left of the previous output i.e. Category will be put into column area¶

We may also put the name of the column/columns in form of a list (by putting the different column names into square bracket) within unstack function to put it into the column area¶

Suppose we have put both Country and Category into the column area and save the result into a data frame¶

Now we may use the stack function, which is just the opposite of unstack function, to push the variables from the column area back into row area¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

You may check out all our instructor led courses from this link. Analytics Courses ¶

If you want to learn real world business project on Machine Learning case studies then you may go through the following links -¶

Titanic Survival Case Study ¶

Cancerl Detection using Machine Learning ¶

T-Shirt Size Prediction ¶

FREE Python visualization tutorial ¶

Contact us for any feedback/ inquiry at Contact us ¶

Pandas Groupby function using Python¶

This is one of the most useful analytics and preprocessing tools of Pandas. As the name Groupby suggests, it groups your data by something. Normally, you would want to group your data by categorical attributes.¶

If you are familiar with SQL queries, Pandas groupby is almost identical to SQL groupby. For both SQL queries and Pandas queries, grouping your data by itself will not have any added value or any output, unless it is accompanied by an aggregate function.¶

It replicates the Pivot Table function of MS Excel. We can replicate all the functionalities of excel pivot table in python using the groupby function.¶

Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)¶

Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function¶

Compute pivot tables and cross-tabulations¶

Perform statistical analysis and other group analyses¶

Import the packages¶

Import the dataset¶

If you want to count the number of rows per Country, you can use the groupby function with the following code:¶

You can group the DataFrame by more than one column as needed. To do so, you will have to introduce the columns you are grouping the DataFrame by in the form of a list of column names.¶

Size is the only function which works without mentioning any variable name to be used for value area.¶

If we want to know the average revenue by Country and Category then we will use the following code:¶

We could have also used the following functions, in place of average (mean)¶

We may use the agg function (alias of aggregate) to show different statistics with groupby function¶

Here we are calculating the mean, standard deviation, sum, count, minimum, maximum and median for Revenue, grouped by first Country and then by Category¶

We may use the agg function to create different aggregations as per separate variables¶

In the below code, we are calculating mean of Revenue, and sum & maximum of Price for different Countries¶

We may use the unstack function to put the inner index (2nd column from the left of the previous output i.e. Category will be put into column area¶

We may also put the name of the column/columns in form of a list (by putting the different column names into square bracket) within unstack function to put it into the column area¶

Suppose we have put both Country and Category into the column area and save the result into a data frame¶

Now we may use the stack function, which is just the opposite of unstack function, to push the variables from the column area back into row area¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

You may check out all our instructor led courses from this link. Analytics Courses¶

If you want to learn real world business project on Machine Learning case studies then you may go through the following links -¶

Titanic Survival Case Study¶

Cancerl Detection using Machine Learning¶

T-Shirt Size Prediction¶

FREE Python visualization tutorial¶

Contact us for any feedback/ inquiry at Contact us¶

You may check out all our instructor led courses from this link. Analytics Courses ¶

Titanic Survival Case Study ¶

Cancerl Detection using Machine Learning ¶

T-Shirt Size Prediction ¶

FREE Python visualization tutorial ¶

Contact us for any feedback/ inquiry at Contact us ¶