The source of the file is IEEE or Institute of Electrical and Electronics Engineers (https://ieee-dataport.org/open-access/boardgamegeek-dataset-board-games). According to their website, "IEEE is the world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity." The dataset contains data collected on board games from the BoardGameGeek (BGG) website in Februay 2021. boardgames.info()The voluntary online community shares reviews and ratings on the website, and the data set contains all ranked games as of February 2021.

In [2]:
import pandas as pd
boardgames = pd.read_excel('/Users/sheilaacar/Documents/BGG_Data_Set.xlsx')
In [3]:
boardgames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20343 entries, 0 to 20342
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  20327 non-null  float64
 1   Name                20343 non-null  object 
 2   Year Published      20342 non-null  float64
 3   Min Players         20343 non-null  int64  
 4   Max Players         20343 non-null  int64  
 5   Play Time           20343 non-null  int64  
 6   Min Age             20343 non-null  int64  
 7   Users Rated         20343 non-null  int64  
 8   Rating Average      20343 non-null  float64
 9   BGG Rank            20343 non-null  int64  
 10  Complexity Average  20343 non-null  float64
 11  Owned Users         20320 non-null  float64
 12  Mechanics           18745 non-null  object 
 13  Domains             10184 non-null  object 
dtypes: float64(5), int64(6), object(3)
memory usage: 2.2+ MB
In [9]:
boardgames.head()
Out[9]:
ID Name Year Published Min Players Max Players Play Time Min Age Users Rated Rating Average BGG Rank Complexity Average Owned Users Mechanics Domains
0 174430.0 Gloomhaven 2017.0 1 4 120 14 42055 8.79244 1 3.8604 68323.0 Action Queue, Action Retrieval, Campaign / Bat... Strategy Games, Thematic Games
1 161936.0 Pandemic Legacy: Season 1 2015.0 2 4 60 13 41643 8.61278 2 2.8405 65294.0 Action Points, Cooperative Game, Hand Manageme... Strategy Games, Thematic Games
2 224517.0 Brass: Birmingham 2018.0 2 4 120 14 19217 8.66337 3 3.9129 28785.0 Hand Management, Income, Loans, Market, Networ... Strategy Games
3 167791.0 Terraforming Mars 2016.0 1 5 120 12 64864 8.43254 4 3.2406 87099.0 Card Drafting, Drafting, End Game Bonuses, Han... Strategy Games
4 233078.0 Twilight Imperium: Fourth Edition 2017.0 3 6 480 14 13468 8.69649 5 4.2219 16831.0 Action Drafting, Area Majority / Influence, Ar... Strategy Games, Thematic Games

I decided to drop the columns that would not be useful for the analysis such as ID and Users Rated.

In [12]:
boardgames = boardgames.drop(columns=['ID','Users Rated'])

I dropped outliers with few owned users. The 25th percentile for owned users is 146. I decided that these games are unpopular, compared to the mean of 1408 owned users. As a result, I decided to drop the games with less than 146 users.

In [15]:
boardgames.drop(boardgames[(boardgames['Owned Users'] < 146)].index, inplace=True)

I noticed from the descriptive statistics that the max for play time is 6000 minutes. I wanted to inspect this further by creating a boxplot of the data. Based on the boxplot, I decided to drop the rows with play time above 10000.

In [18]:
boardgames.boxplot(grid=False, rot=45, column = ['Play Time'])
Out[18]:
<Axes: >
No description has been provided for this image
In [20]:
boardgames.drop(boardgames[(boardgames['Play Time'] > 10000)].index, inplace=True)

I dropped missing values from the dataframe and dropped duplicates.

In [24]:
boardgames = boardgames.dropna(inplace = False)
boardgames.info()
boardgames.shape
<class 'pandas.core.frame.DataFrame'>
Index: 8623 entries, 0 to 20342
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                8623 non-null   object 
 1   Year Published      8623 non-null   float64
 2   Min Players         8623 non-null   int64  
 3   Max Players         8623 non-null   int64  
 4   Play Time           8623 non-null   int64  
 5   Min Age             8623 non-null   int64  
 6   Rating Average      8623 non-null   float64
 7   BGG Rank            8623 non-null   int64  
 8   Complexity Average  8623 non-null   float64
 9   Owned Users         8623 non-null   float64
 10  Mechanics           8623 non-null   object 
 11  Domains             8623 non-null   object 
dtypes: float64(4), int64(5), object(3)
memory usage: 875.8+ KB
Out[24]:
(8623, 12)
In [26]:
boardgames.drop_duplicates()
Out[26]:
Name Year Published Min Players Max Players Play Time Min Age Rating Average BGG Rank Complexity Average Owned Users Mechanics Domains
0 Gloomhaven 2017.0 1 4 120 14 8.79244 1 3.8604 68323.0 Action Queue, Action Retrieval, Campaign / Bat... Strategy Games, Thematic Games
1 Pandemic Legacy: Season 1 2015.0 2 4 60 13 8.61278 2 2.8405 65294.0 Action Points, Cooperative Game, Hand Manageme... Strategy Games, Thematic Games
2 Brass: Birmingham 2018.0 2 4 120 14 8.66337 3 3.9129 28785.0 Hand Management, Income, Loans, Market, Networ... Strategy Games
3 Terraforming Mars 2016.0 1 5 120 12 8.43254 4 3.2406 87099.0 Card Drafting, Drafting, End Game Bonuses, Han... Strategy Games
4 Twilight Imperium: Fourth Edition 2017.0 3 6 480 14 8.69649 5 4.2219 16831.0 Action Drafting, Area Majority / Influence, Ar... Strategy Games, Thematic Games
... ... ... ... ... ... ... ... ... ... ... ... ...
20337 Trouble 1965.0 2 4 45 4 3.79191 20339 1.0541 4962.0 Roll / Spin and Move Children's Games
20339 Bingo 1530.0 2 99 60 5 2.85331 20341 1.0455 1533.0 Betting and Bluffing, Bingo, Pattern Recognition Party Games
20340 Candy Land 1949.0 2 4 30 3 3.17792 20342 1.0779 5788.0 Roll / Spin and Move Children's Games
20341 Chutes and Ladders -200.0 2 6 30 3 2.85567 20343 1.0201 4400.0 Dice Rolling, Grid Movement, Race, Roll / Spin... Children's Games
20342 Tic-Tac-Toe -1300.0 2 2 1 4 2.67544 20344 1.1580 1374.0 Paper-and-Pencil, Pattern Building Abstract Games, Children's Games

8623 rows × 12 columns

In [32]:
boardgames.to_csv('/Users/sheilaacar/Documents/boardgames_clean.csv', header=True, index=False)