Analytics Educator
  • Home
  • Courses
  • Blog
  • FAQ
  • Contact Us
  • Home
  • Courses
  • FAQ
  • Contact
Home   /   Blog   /   Details

Pandas Groupby function using Python¶

This is one of the most useful analytics and preprocessing tools of Pandas. As the name Groupby suggests, it groups your data by something. Normally, you would want to group your data by categorical attributes.¶

If you are familiar with SQL queries, Pandas groupby is almost identical to SQL groupby. For both SQL queries and Pandas queries, grouping your data by itself will not have any added value or any output, unless it is accompanied by an aggregate function.¶

It replicates the Pivot Table function of MS Excel. We can replicate all the functionalities of excel pivot table in python using the groupby function.¶

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, can be a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a versatile groupby interface, enabling you to slice, dice, and summarize datasets in a natural way. In this article you wll learn the following:¶

Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)¶

Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function¶

Compute pivot tables and cross-tabulations¶

Perform statistical analysis and other group analyses¶

Import the packages¶

In [1]:
import pandas as pd
import numpy as np
import os
os.chdir("C:\\AnalyticsEducator\\Python Free Course")

Import the dataset¶

In [2]:
data = pd.read_csv("Pivot_data.csv")
data.head(3)
Out[2]:
Country Category Department Price Revenue
0 Netherlands Furniture Dining 499.0 649199.0
1 Netherlands Furniture Dining 699.0 638187.0
2 Netherlands Furniture Dining 549.0 632997.0

If you want to count the number of rows per Country, you can use the groupby function with the following code:¶

In [3]:
data.groupby("Country").size()
Out[3]:
Country
India          100
Japan          100
Netherlands    100
dtype: int64

You can group the DataFrame by more than one column as needed. To do so, you will have to introduce the columns you are grouping the DataFrame by in the form of a list of column names.¶

Size is the only function which works without mentioning any variable name to be used for value area.¶

In [17]:
data.groupby(["Country","Category"]).size()
Out[17]:
Country      Category 
India        Furniture    21
             Home         26
             Textiles     53
Japan        Furniture    14
             Home         16
             Lighting      8
             Textiles     62
Netherlands  Furniture    54
             Home          6
             Textiles     40
dtype: int64

If we want to know the average revenue by Country and Category then we will use the following code:¶

In [43]:
data.groupby(["Country","Category"]).Revenue.mean()
Out[43]:
Country      Category 
India        Furniture     45829.000000
             Home          36411.488077
             Textiles      40760.823585
Japan        Furniture     15987.214286
             Home          15259.449375
             Lighting      14896.080000
             Textiles      15727.381452
Netherlands  Furniture    201314.185185
             Home         168940.676667
             Textiles     117717.962750
Name: Revenue, dtype: float64

We could have also used the following functions, in place of average (mean)¶

count()    --> Number of non null values
sum()      --> Sum of values
mean()     --> Average of values
median()   --> Median of values
mode()     --> Mode of values
mad()      --> Mean Absolute Deviation
std()      --> Standard Deviation
var()      --> Variance
min()      --> Minimum
max()      --> Maximum
describe() --> Different descriptive statistics together

We may use the agg function (alias of aggregate) to show different statistics with groupby function¶

Here we are calculating the mean, standard deviation, sum, count, minimum, maximum and median for Revenue, grouped by first Country and then by Category¶

In [62]:
data.groupby(["Country","Category"]).Revenue.agg(["mean","std","sum","count","min","max","median"])
Out[62]:
mean std sum count min max median
Country Category
India Furniture 45829.000000 11459.290999 962409.00 21 29952.00 61035.00 45760.000
Home 36411.488077 8417.442794 946698.69 26 26632.80 62038.60 33744.405
Textiles 40760.823585 11411.131060 2160323.65 53 26595.00 64841.00 37228.950
Japan Furniture 15987.214286 7799.005333 223821.00 14 1280.00 25948.00 15982.000
Home 15259.449375 6067.927488 244151.19 16 6494.50 24675.00 15018.415
Lighting 14896.080000 5452.128320 119168.64 8 6987.67 21805.00 15406.625
Textiles 15727.381452 5825.758681 975097.65 62 2818.59 25991.35 16785.190
Netherlands Furniture 201314.185185 156258.926896 10870966.00 54 68229.00 649199.00 155163.000
Home 168940.676667 89720.689814 1013644.06 6 66547.77 319304.00 155192.000
Textiles 117717.962750 68257.960203 4708718.51 40 65484.45 431584.10 101217.425

We may use the agg function to create different aggregations as per separate variables¶

In the below code, we are calculating mean of Revenue, and sum & maximum of Price for different Countries¶

In [59]:
data.groupby(["Country"]).agg({"Revenue":"mean","Price":["sum","max"]})
Out[59]:
Country Revenue Price
mean sum max
0 India 40694.3134 6899.98 529.0
1 Japan 15622.3848 5766.55 699.0
2 Netherlands 165933.2857 28428.14 899.0

We may use the unstack function to put the inner index (2nd column from the left of the previous output i.e. Category will be put into column area¶

In [21]:
data.groupby(["Country","Category"]).Revenue.mean().unstack()
Out[21]:
Category Furniture Home Lighting Textiles
Country
India 45829.000000 36411.488077 NaN 40760.823585
Japan 15987.214286 15259.449375 14896.08 15727.381452
Netherlands 201314.185185 168940.676667 NaN 117717.962750

We may also put the name of the column/columns in form of a list (by putting the different column names into square bracket) within unstack function to put it into the column area¶

In [36]:
data.groupby(["Country","Category"]).Revenue.mean().unstack("Country")
Out[36]:
Country India Japan Netherlands
Category
Furniture 45829.000000 15987.214286 201314.185185
Home 36411.488077 15259.449375 168940.676667
Lighting NaN 14896.080000 NaN
Textiles 40760.823585 15727.381452 117717.962750

Suppose we have put both Country and Category into the column area and save the result into a data frame¶

In [41]:
Analytics_Educator = data.groupby(["Country","Category","Department"]).Revenue.sum().unstack(["Country","Category"])
Analytics_Educator
Out[41]:
Country India Japan Netherlands
Category Furniture Home Textiles Furniture Home Lighting Textiles Furniture Home Textiles
Department
Bath NaN 129455.13 328633.10 NaN 38014.64 NaN 344259.95 NaN NaN 719079.49
Bedroom 82495.0 NaN 869674.70 72535.0 NaN NaN 230126.56 803198.0 126684.00 3647101.85
Dining 73210.0 NaN 272442.00 NaN 30796.50 NaN 163830.19 3960201.0 NaN 81455.25
Entryways 52500.0 234005.00 NaN NaN 40787.50 NaN NaN 553007.0 NaN NaN
Kids Room 223299.0 NaN 404392.09 47285.0 NaN 30813.25 86940.00 1327027.0 NaN 261081.92
Kitchen NaN 583238.56 285181.76 NaN 134552.55 NaN 149940.95 NaN 886960.06 NaN
Living Room 530905.0 NaN NaN 104001.0 NaN 88355.39 NaN 4227533.0 NaN NaN

Now we may use the stack function, which is just the opposite of unstack function, to push the variables from the column area back into row area¶

In [42]:
Analytics_Educator.stack(["Country","Category"])
Out[42]:
Department   Country      Category 
Bath         India        Home          129455.13
                          Textiles      328633.10
             Japan        Home           38014.64
                          Textiles      344259.95
             Netherlands  Textiles      719079.49
Bedroom      India        Furniture      82495.00
                          Textiles      869674.70
             Japan        Furniture      72535.00
                          Textiles      230126.56
             Netherlands  Furniture     803198.00
                          Home          126684.00
                          Textiles     3647101.85
Dining       India        Furniture      73210.00
                          Textiles      272442.00
             Japan        Home           30796.50
                          Textiles      163830.19
             Netherlands  Furniture    3960201.00
                          Textiles       81455.25
Entryways    India        Furniture      52500.00
                          Home          234005.00
             Japan        Home           40787.50
             Netherlands  Furniture     553007.00
Kids Room    India        Furniture     223299.00
                          Textiles      404392.09
             Japan        Furniture      47285.00
                          Lighting       30813.25
                          Textiles       86940.00
             Netherlands  Furniture    1327027.00
                          Textiles      261081.92
Kitchen      India        Home          583238.56
                          Textiles      285181.76
             Japan        Home          134552.55
                          Textiles      149940.95
             Netherlands  Home          886960.06
Living Room  India        Furniture     530905.00
             Japan        Furniture     104001.00
                          Lighting       88355.39
             Netherlands  Furniture    4227533.00
dtype: float64

Analytics Educator is the best institute for Data Science courses, based out of Kolkata. We provide training to all aspirants, who are willing to learn Data Science and get a job, across the globe. We specialize in providing training on data science even to students coming from non-technical background with zero programming or statistical knowledge. We help the associates to learn data science and get job in this field.¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

You may check out all our instructor led courses from this link. Analytics Courses¶

If you want to learn real world business project on Machine Learning case studies then you may go through the following links -¶

Titanic Survival Case Study¶

Cancerl Detection using Machine Learning¶

T-Shirt Size Prediction¶

FREE Python visualization tutorial¶

Contact us for any feedback/ inquiry at Contact us¶

Copyright © 2017 Analytics Educator