Analysis of a public Airbnb dataset from Kaggle: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Dataset describes the Airbnb listing activity and metrics in NYC, NY for 2019. It includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
nyc_df = pd.read_csv('AB_NYC_2019.csv')
nyc_df.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
Based on the raw data provided we can explore the following questions:
In this stage the data is checked for accuracy and completeness prior to beginning the analysis.
nyc_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 48895 non-null int64
1 name 48879 non-null object
2 host_id 48895 non-null int64
3 host_name 48874 non-null object
4 neighbourhood_group 48895 non-null object
5 neighbourhood 48895 non-null object
6 latitude 48895 non-null float64
7 longitude 48895 non-null float64
8 room_type 48895 non-null object
9 price 48895 non-null int64
10 minimum_nights 48895 non-null int64
11 number_of_reviews 48895 non-null int64
12 last_review 38843 non-null object
13 reviews_per_month 38843 non-null float64
14 calculated_host_listings_count 48895 non-null int64
15 availability_365 48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
The name, host_name, last_review and reviews_per_month columns have missing values.
The missing values originate from a variety of reasons:
Additionally, it is recommended to make the above values a required field for listing data collection purposes to avoid missing data in the future.
##Rows with missing value for last_review
nyc_df['name'].isnull().value_counts()
False 48879
True 16
Name: name, dtype: int64
##Rows with missing value for last_review
nyc_df['host_name'].isnull().value_counts()
False 48874
True 21
Name: host_name, dtype: int64
##Rows with missing value for last_review
nyc_df['last_review'].isnull().value_counts()
False 38843
True 10052
Name: last_review, dtype: int64
##Rows with missing value for reviews_per_month
nyc_df['reviews_per_month'].isnull().value_counts()
False 38843
True 10052
Name: reviews_per_month, dtype: int64
The majority of the missing data is from the “last_review” and “reviews_per_month” columns. Since the other two columns only have relatively few missing values we can just delete the associated rows.
The reviews_per_month and last_review values are connected to each other. The missing values represent the fact that a review has not been left for that listing.
#Replacing missing values
nyc_df['reviews_per_month'] = nyc_df['reviews_per_month'].fillna(0.0)
## We have replaced all the missing reviews_per_month with a 0.0
nyc_df['reviews_per_month'].isnull().value_counts()
False 48895
Name: reviews_per_month, dtype: int64
#Replacing missing values
nyc_df['last_review'] = nyc_df['last_review'].fillna("2019-12-31")
##Rows with missing value for last_review
nyc_df['last_review'].isnull().value_counts()
False 48895
Name: last_review, dtype: int64
#Deleting rows with missing name values
nyc_df.dropna(subset=['name'], inplace=True)
#Deleting rows with missing 'host_name' values
nyc_df.dropna(subset=['host_name'], inplace=True)
nyc_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48858 entries, 0 to 48894
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 48858 non-null int64
1 name 48858 non-null object
2 host_id 48858 non-null int64
3 host_name 48858 non-null object
4 neighbourhood_group 48858 non-null object
5 neighbourhood 48858 non-null object
6 latitude 48858 non-null float64
7 longitude 48858 non-null float64
8 room_type 48858 non-null object
9 price 48858 non-null int64
10 minimum_nights 48858 non-null int64
11 number_of_reviews 48858 non-null int64
12 last_review 48858 non-null object
13 reviews_per_month 48858 non-null float64
14 calculated_host_listings_count 48858 non-null int64
15 availability_365 48858 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.3+ MB
All missing values have been dealt with.
#Converting last_review to datetime values
nyc_df['last_review'] = pd.to_datetime(nyc_df['last_review'])
#Converting host_id to string
nyc_df['host_id'] = nyc_df['host_id'].astype(str)
In this stage we are adding new features that will provide more insight into the data.
#Identifying which month the review was left
nyc_df['month'] = nyc_df['last_review'].apply(lambda time: time.month)
#We needs to convert the values in the Month column from numbers to names of Months
dmap = {1:'January',2:'February',3:'March',4:'April',5:'May',6:'June',7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'}
#Mapping our new dictionary to the Month column in the Dataframe
nyc_df['month'] = nyc_df['month'].map(dmap)
‘listing_coordinate’ is the latitude and longitude pair for each listing that can be used to calculate distances to other points.
#Creating a dataframe for latitude and longitude of each listing
locations_df = nyc_df[["latitude","longitude"]]
#Creating a list of tuples
locations_df = locations_df.to_records(index=False)
#There is now a column 'locations' with the latitude and longitude pair for each listing. Example,
nyc_df['listing_coordinates'] = list(locations_df)
The neighbourhood groups represent the boroughs of New York City. As people use Airbnb for short visits primarily it is useful to know how far listings are from popular locations in each borough like Central Park in Manhattan.
#Identifying unique neighbourhoods groups (boroughs)
nyc_df['neighbourhood_group'].unique()
array(['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx'],
dtype=object)
Major Attraction in Each Neighbourhoods Group (Borough)
#Coordinates of Major Attractions
brooklyn_bridge = (40.706001,-73.997002)
central_park = (40.769361,-73.977655)
citi_field = (40.75416365,-73.84082997)
staten_island_ferry = (40.643333,-74.074167)
bronx_zoo = (40.852905,-73.872971)
#Creating a dictionary with the values in the neighbourhood group column to the names of the related attraction
dmap_locations = {'Brooklyn':"Brooklyn Bridge", 'Manhattan':"Central Park", 'Queens':"Citi Field", 'Staten Island':"St. Georges Ferry Terminal", 'Bronx':"Bronx Zoo"}
#Creating a dictionary with the values in the neighbourhood group column to the coordinates of the related attraction
dmap_coordinates = {'Brooklyn':brooklyn_bridge, 'Manhattan':central_park, 'Queens':citi_field, 'Staten Island':staten_island_ferry, 'Bronx':bronx_zoo}
#Mapping our new dictionary to the neighbourhood_group column in the Dataframe
nyc_df['major_attraction_location'] = nyc_df['neighbourhood_group'].map(dmap_locations)
#Mapping our new dictionary to the neighbourhood_group column in the Dataframe
nyc_df['major_attraction_coordinates'] = nyc_df['neighbourhood_group'].map(dmap_coordinates)
import haversine
from haversine import haversine, Unit
#haversine can calculate the distance (in various units) between two points on Earth using their latitude and longitude.
# We use the lambda function to apply haversine to calculate the distance to central park from each listing in Miles
# The default units for haversine are Km but we have set it to Miles here
# x.listing_coordinates and x.major_attraction_coordinates represent the coordinates of the listing location and the major attraction
nyc_df['distance_to_major_attractions'] = nyc_df.apply(lambda x: haversine(x.listing_coordinates,x.major_attraction_coordinates,unit=Unit.MILES), axis = 1)
#Rounding the values for readability
nyc_df['distance_to_major_attractions'] = nyc_df['distance_to_major_attractions'].round(2)
In this stage, we will examine the data to identify any patterns, trends and relationships between the variables. It will help us analyze the data and extract insights that can be used to make decisions.
Data Visualization will give us a clear idea of what the data means by giving it visual context.
sns.heatmap(nyc_df.corr())
<AxesSubplot:>
There does not appear to be any significant correlation between variables
nyc_df['host_id'].describe()
count 48858
unique 37425
top 219517861
freq 327
Name: host_id, dtype: object
There are 37425 unique hosts with host 219517861 having the most listings (327).
nyc_df[nyc_df['host_id'] == '219517861'].head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | ... | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | month | listing_coordinates | major_attraction_location | major_attraction_coordinates | distance_to_major_attractions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38293 | 30181691 | Sonder | 180 Water | Incredible 2BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Financial District | 40.70637 | -74.00645 | Entire home/apt | 302 | ... | 0 | 2019-12-31 | 0.00 | 327 | 309 | December | [40.70637, -74.00645] | Central Park | (40.769361, -73.977655) | 4.61 |
38294 | 30181945 | Sonder | 180 Water | Premier 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Financial District | 40.70771 | -74.00641 | Entire home/apt | 229 | ... | 1 | 2019-05-29 | 0.73 | 327 | 219 | May | [40.70771, -74.00641] | Central Park | (40.769361, -73.977655) | 4.52 |
38588 | 30347708 | Sonder | 180 Water | Charming 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Financial District | 40.70743 | -74.00443 | Entire home/apt | 232 | ... | 1 | 2019-05-21 | 0.60 | 327 | 159 | May | [40.70743, -74.00443] | Central Park | (40.769361, -73.977655) | 4.50 |
39769 | 30937590 | Sonder | The Nash | Artsy 1BR + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Murray Hill | 40.74792 | -73.97614 | Entire home/apt | 262 | ... | 8 | 2019-06-09 | 1.86 | 327 | 91 | June | [40.74792, -73.97614] | Central Park | (40.769361, -73.977655) | 1.48 |
39770 | 30937591 | Sonder | The Nash | Lovely Studio + Rooftop | 219517861 | Sonder (NYC) | Manhattan | Murray Hill | 40.74771 | -73.97528 | Entire home/apt | 255 | ... | 14 | 2019-06-10 | 2.59 | 327 | 81 | June | [40.74771, -73.97528] | Central Park | (40.769361, -73.977655) | 1.50 |
5 rows × 21 columns
The top host (219517861) is Sonder (NYC)
nyc_df['neighbourhood'].describe()
count 48858
unique 221
top Williamsburg
freq 3917
Name: neighbourhood, dtype: object
There are 221 neighbourhoods with Williamsburg having the most listings (3917).
nyc_df['neighbourhood'].value_counts().head(10)
Williamsburg 3917
Bedford-Stuyvesant 3713
Harlem 2655
Bushwick 2462
Upper West Side 1969
Hell's Kitchen 1954
East Village 1852
Upper East Side 1797
Crown Heights 1563
Midtown 1545
Name: neighbourhood, dtype: int64
#Calulating the total number of listings that the top 10 neighbourhoods account for
nyc_df['neighbourhood'].value_counts().head(10).sum()
23427
round((23427/48858)*100,2)
47.95
The top 10 neighbourhoods represent about 47.95% of all listings.
#Identifying unique neighbourhoods
nyc_df['neighbourhood_group'].describe()
count 48858
unique 5
top Manhattan
freq 21643
Name: neighbourhood_group, dtype: object
There are 5 neighbourhood groups with Manhattan having the most listings (21643).
nyc_df['neighbourhood_group'].value_counts()
Manhattan 21643
Brooklyn 20089
Queens 5664
Bronx 1089
Staten Island 373
Name: neighbourhood_group, dtype: int64
#Identifying number of rooms of each time
nyc_df['room_type'].value_counts()
Entire home/apt 25393
Private room 22306
Shared room 1159
Name: room_type, dtype: int64
sns.countplot(x='room_type',data=nyc_df,palette='viridis')
plt.title("Number of Rooms of Each Type",fontsize=20)
Text(0.5, 1.0, 'Number of Rooms of Each Type')
The majority of the listings are Entire home/apts or Private rooms.
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
# We use order = nyc_df['Month'].value_counts().index to help us sort the count plot by the value counts
sns.countplot(x='month',data=nyc_df,order = nyc_df['month'].value_counts().index,palette='viridis')
plt.title("Airbnb Listings Each Month",fontsize=20)
Text(0.5, 1.0, 'Airbnb Listings Each Month')
nyc_df['availability_365'].mean()
112.80142453641164
On average, any given listing is available 113 days in a year.
# Identifying the average availability for each neighbourhood group (rounded to 2 decimal places)
nbhd_group = nyc_df.groupby('neighbourhood_group')['availability_365'].mean().round(2)
#Converting the series nbhd to a dataframe
nbhd_group = nbhd_group.to_frame()
#Renaming columns
nbhd_group.rename(columns={'availability_365': 'average_availability'}, inplace=True)
# Identifying the average price for each neighbourhood group (rounded to 2 decimal places)
nbhd_group['average_price'] = nyc_df.groupby('neighbourhood_group')['price'].mean().round()
# Identifying the average number of reviews per listing for each neighbourhood group (rounded to 2 decimal places)
nbhd_group['average_number_of_reviews_per_listing'] = nyc_df.groupby('neighbourhood_group')['number_of_reviews'].mean().round()
# Identifying the total number of reviews for each neighbourhood group(rounded to 2 decimal places)
nbhd_group['total_number_of_reviews'] = nyc_df.groupby('neighbourhood_group')['number_of_reviews'].sum().round(2)
nbhd_group.sort_values(by=['average_availability'])
average_availability | average_price | average_number_of_reviews_per_listing | total_number_of_reviews | |
---|---|---|---|---|
neighbourhood_group | ||||
Brooklyn | 100.24 | 124.0 | 24.0 | 486174 |
Manhattan | 112.01 | 197.0 | 21.0 | 454126 |
Queens | 144.49 | 100.0 | 28.0 | 156902 |
Bronx | 165.70 | 87.0 | 26.0 | 28334 |
Staten Island | 199.68 | 115.0 | 31.0 | 11541 |
On average,
nyc_df['minimum_nights'].mean()
7.012444226124688
Average duration of stay for all listings is 7 days.
nyc_df.groupby('neighbourhood')['minimum_nights'].mean().sort_values()
neighbourhood
Breezy Point 1.000000
New Dorp 1.000000
Oakwood 1.200000
East Morrisania 1.400000
Woodlawn 1.454545
...
Bay Terrace, Staten Island 16.500000
Vinegar Hill 18.352941
Olinville 23.500000
North Riverdale 41.400000
Spuyten Duyvil 48.250000
Name: minimum_nights, Length: 221, dtype: float64
Listings in the Spuyten Duyvil neighbourhood offer the longest average duration of stay at approximately 48 days.
nyc_df.groupby('neighbourhood_group')['minimum_nights'].mean()
neighbourhood_group
Bronx 4.564738
Brooklyn 6.057693
Manhattan 8.538188
Queens 5.182910
Staten Island 4.831099
Name: minimum_nights, dtype: float64
Listings in the Manhattan neighbourhood group offer the longest average duration of stay at approximately 9 days.
nyc_df['distance_to_major_attractions'].mean()
3.0813486020713086
On average, any given listing is 3.1 miles from the closest major attraction.
# Identifying the average distance to the closest major attraction for each neighbourhood group (rounded to 2 decimal places)
nbhd_group = nyc_df.groupby('neighbourhood_group')['distance_to_major_attractions'].mean().round(2)
#Converting the series nbhd to a dataframe
nbhd_group = nbhd_group.to_frame()
#Renaming columns
nbhd_group.rename(columns={'distance_to_major_attractions': 'average_distance_to_major_attractions'}, inplace=True)
# Identifying the average price for each neighbourhood group (rounded to 2 decimal places)
nbhd_group['average_price'] = nyc_df.groupby('neighbourhood_group')['price'].mean().round()
# Identifying the average number of reviews for each neighbourhood group (rounded to 2 decimal places)
nbhd_group['average_number_of_reviews_per_listing'] = nyc_df.groupby('neighbourhood_group')['number_of_reviews'].mean().round()
# Identifying the total number of reviews per listing for each neighbourhood group(rounded to 2 decimal places)
nbhd_group['total_number_of_reviews'] = nyc_df.groupby('neighbourhood_group')['number_of_reviews'].sum().round(2)
nbhd_group.sort_values(by=['average_distance_to_major_attractions'])
average_distance_to_major_attractions | average_price | average_number_of_reviews_per_listing | total_number_of_reviews | |
---|---|---|---|---|
neighbourhood_group | ||||
Bronx | 2.42 | 87.0 | 26.0 | 28334 |
Manhattan | 2.56 | 197.0 | 21.0 | 454126 |
Staten Island | 3.12 | 115.0 | 31.0 | 11541 |
Brooklyn | 3.37 | 124.0 | 24.0 | 486174 |
Queens | 4.19 | 100.0 | 28.0 | 156902 |
On average,
avg_all_listings = round(nyc_df['price'].mean(),2)
avg_all_listings
152.74
afig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(y='reviews_per_month',x='price', data=nyc_df)
plt.title("Reviews per Month vs. Price",fontsize=20)
Text(0.5, 1.0, 'Reviews per Month vs. Price')
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(y='number_of_reviews',x='price', data=nyc_df)
plt.title("Number of Reviews vs. Price",fontsize=20)
Text(0.5, 1.0, 'Number of Reviews vs. Price')
Based on the plot we can see that the majority of more expensive listings receive fewers reviews as compared to less expensive ones.
# Identifying the average listing price for each neighbourhood (rounded to 2 decimal places)
nbhd = nyc_df.groupby('neighbourhood')['price'].mean().round(2)
#Converting the series nbhd to a dataframe
nbhd = nbhd.to_frame()
#Renaming columns
nbhd.rename(columns={'price': 'average_price'}, inplace=True)
# Identifying the average number of reviews for each neighbourhood (rounded to 2 decimal places)
nbhd['average_number_of_reviews'] = nyc_df.groupby('neighbourhood')['number_of_reviews'].mean().round()
nbhd.head()
average_price | average_number_of_reviews | |
---|---|---|
neighbourhood | ||
Allerton | 87.60 | 43.0 |
Arden Heights | 67.25 | 8.0 |
Arrochar | 115.00 | 15.0 |
Arverne | 171.78 | 29.0 |
Astoria | 117.19 | 21.0 |
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(y='average_number_of_reviews',x='average_price', data=nbhd)
plt.title("Number of Reviews vs. Price: Aggregated by Neighbourhood",fontsize=20)
Text(0.5, 1.0, 'Number of Reviews vs. Price: Aggregated by Neighbourhood')
We see that once the data is aggregated by neighbourhood averages, there is still a larger number of reviews left for the less expensive listings as compared to the more expensive ones.
nbhd[nbhd['average_price']>avg_all_listings].count()
average_price 55
average_number_of_reviews 55
dtype: int64
There are 55 neighbourhoods with average listing price above the average for all listings.
nbhd[nbhd['average_price']<avg_all_listings].count()
average_price 166
average_number_of_reviews 166
dtype: int64
There are 166 neighbourhoods with average listing price below the average for all listings.
nyc_df.groupby('neighbourhood_group')['price'].std()
neighbourhood_group
Bronx 106.798933
Brooklyn 186.936694
Manhattan 291.489822
Queens 167.128794
Staten Island 277.620403
Name: price, dtype: float64
Largest standard deviation in price is in Manhattan.
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(y='price',x='neighbourhood_group', data=nyc_df)
plt.title("Listing Price by Neighbourhood Group",fontsize=20)
Text(0.5, 1.0, 'Listing Price by Neighbourhood Group')
The spread of prices is greatest in Manhattan.
# Identifying the average listing price for each neighbourhood (rounded to 2 decimal places)
nbhd_group = nyc_df.groupby('neighbourhood_group')['price'].mean().round(2)
#Converting the series nbhd to a dataframe
nbhd_group = nbhd_group.to_frame()
#Renaming columns
nbhd_group.rename(columns={'price': 'average_price'}, inplace=True)
# Identifying the average number of reviews for each neighbourhood group(rounded to 2 decimal places)
nbhd_group['total_number_of_reviews'] = nyc_df.groupby('neighbourhood_group')['number_of_reviews'].sum().round(2)
nbhd_group['number_of_listings'] = nyc_df['neighbourhood_group'].value_counts()
#Ratio of reviews as compared to total number of listings for each neighbourhood group
nbhd_group['ratio'] = (nbhd_group['total_number_of_reviews']/nbhd_group['number_of_listings']).round(2)
nbhd_group.head()
average_price | total_number_of_reviews | number_of_listings | ratio | |
---|---|---|---|---|
neighbourhood_group | ||||
Bronx | 87.47 | 28334 | 1089 | 26.02 |
Brooklyn | 124.41 | 486174 | 20089 | 24.20 |
Manhattan | 196.90 | 454126 | 21643 | 20.98 |
Queens | 99.54 | 156902 | 5664 | 27.70 |
Staten Island | 114.81 | 11541 | 373 | 30.94 |
We notice something interesting in the data here:
Manhattan has the second largest number of listings but has the least number of reviews compared to the actual number of listings, which indicates that reviews are left less frequently for stays in the Manhattan neighbourhood group. The possible reasons for this are as follows:
room_type = nyc_df.groupby('room_type')['price'].mean().round(2)
#Converting the series nbhd to a dataframe
room_type = room_type.to_frame()
#Renaming columns
room_type.rename(columns={'price': 'average_price'}, inplace=True)
# Identifying the average number of reviews for each neighbourhood (rounded to 2 decimal places)
room_type['total_number_of_reviews'] = nyc_df.groupby('room_type')['number_of_reviews'].sum().round()
room_type
average_price | total_number_of_reviews | |
---|---|---|
room_type | ||
Entire home/apt | 211.81 | 579856 |
Private room | 89.79 | 537965 |
Shared room | 70.08 | 19256 |
As expected, listings with Entire home/apt are the most expensive.
# Form a facetgrid using columns with a hue
graph = sns.FacetGrid(nyc_df, col ='room_type')
# map the above form facetgrid with some attributes
graph.map(sns.scatterplot, "price","number_of_reviews")
#Setting the title for the FacetGrid
graph.fig.subplots_adjust(top=0.8)
graph.fig.suptitle('Number of Reviews vs. Price for Each Room Type', fontsize=20)
Text(0.5, 0.98, 'Number of Reviews vs. Price for Each Room Type')
There are more reviews for less expensive listings regardless of the room types.
The majority of the listings are Entire home/apts or Private rooms.
There are 5 neighbourhood groups with Manhattan having the most listings (21643).
Manhattan has the second largest number of listings but has the least number of reviews compared to the actual number of listings, which indicates that reviews are left less frequently for stays in the Manhattan neighbourhood group. The possible reasons for this are as follows:
Additional Data necessary
The data only tells us if a review was left or not for any given listing. It would be beneficial to know what score each listing received when they were reviewed. We can only go off the number of reviews listings receive and assume listings (and by extension neighbourhoods and neighbourhood groups) with more reviews are preferable.
#Exporting the dataset (without the index values)
#nyc_df.to_csv('Airbnb.csv', index=False)