IBM APPLIED DATA SCIENCE CAPSTONE PROJECT
RESTAURANTS IN NEW YORK’S BOROUGHS
INTRODUCTION
Couple of years ago I went New York for language school, and I had a chance to see and live one of the most beautiful city in world. So, IBM Capstone project I cannot imagine more appropriate example. New York is one of the most crowded, cosmopolitan city in the world. In New York nearly 19 million people are living in 5 boroughs and 306 neighborhoods. But for the project I focused boroughs rather than neighborhoods which they are “Bronx”, “Brooklyn”, “Manhattan”, “Queens”, and “Staten Island”. I want to see which types of restaurants are popular in these 5 boroughs.
In this project I used Foursquare API and NYU (New York University) Spatial Data Repository. After downloading data and data cleansing, I used to explore function of Foursquare API, and merge them into one data frame, and using k-means clustering algorithm to complete project. Also, I want to use visualization tools to express more clearly what I am doing and what I found. For this reason, I used matplotlib library and folium library.
DATA
I used Foursquare API for venues details. Such as geolocations of venues, names of venues, category of venues.
Also, I took New York data from New York University Spatial Data Repository https://geo.nyu.edu/catalog/nyu_2451_34572 which is contains boroughs and neighborhoods data.
METHODOLOGY
Firstly, I took New York data from New York University Spatial Data Repository https://geo.nyu.edu/catalog/nyu_2451_34572 which is contains boroughs and neighborhoods data. I used Foursquare API for venues details. Such as geolocations of venues, names of venues, category of venues. In the below you can see first item in the NYU New York data.
After examining the first item I decide which data I will use then transformed the dictionary to data frame and inserting every item one by one. Here is the first five elements of data frame I created.
Then, I control the data frame and saw New York has 306 neighborhoods in 5 boroughs and marked every neighborhood in the map.
Secondly, I connected to Foursquare API for taking venue information which are included name, latitude, longitude, and category of venues and merge them with my first data frame and data frame became like this.
Then I examine new data frame which is 20613 rows and 7 columns. When I am importing data, I was not specifying to restaurant I imported every venue. Then I checked how many restaurants and which types of restaurants are exist in New York, and that part is one of the crucial parts of my project because I must find relatively high number of places for better result. So, I had to decrease size of data frame and, I should not keep irrelevant data. So, I checked the restaurants in New York and found there are 4757 restaurants in 92 unique categories. I was not expecting that much category, but number of restaurants are very good.
Thirdly I want to see which types of restaurants are more that is why I created bar chart for see in better visualization way comparing to table.
Not surprisingly, Italian, and Chinese restaurants are most popular ones, result was not shocking, but numbers were. There are 522 Italian restaurants and 400 Chinese restaurants. In the first part I was mentioned about New York is very cosmopolitan city.
I looked restaurants numbers by boroughs.
After this process I made one hot encoding and find frequency of all 92 restaurants types in all boroughs and return them by frequency for each neighborhood.
Thirdly, I must find optimum value of k-means which is always data dependent. So I used Elbow method to see which number of clusters will be give optimum result.
As showing in the chart after two, elbow is happening what it means two cluster is the optimum solution, that is why I applied two clusters.
Applying two clusters in the map, Staten Island and Bronx are one cluster, and Manhattan, Brooklyn, Queens are second cluster.
RESULT
After data cleansing and aggregation, I used K-means algorithm, and try to find how many clusters will enough reach my goal so I used Elbow Method, and it gives 2two cluster is the optimal choice.
For first cluster, we are seeing three boroughs(“Brooklyn”,”Manhattan”,”Queens”) which are center and west side of New York. That table is showing Japanese and Caribbean Restaurants are more popular than another cluster.
For the second cluster, we are seeing only “Bronx” and “Staten Island” have got Spanish Restaurant and Fast-Food Restaurants are more popular in cluster two (First in Bronx, Forth in Staten Island).
In generally Italian Restaurants are most popular type of restaurant by far, this result was expecting because there are 522 Italian restaurants in New York.
DISCUSSION
In this project I tried to use everything I learned from IBM Data Science education. Data cleansing, data aggregation and analysis, visualization and most importantly machine learning algorithms, and finally I improved my Python knowledge.
Only thing I want to add more category such as price, tips. If I had those features, I may examine the New York restaurants more deeply and may be trying to create suggestion phase.
CONCLUSION
Finally, I reached my goal, which is “Which types of restaurants are popular in New York’s Boroughs. In further analysis it may improve for all neighborhoods not boroughs, also may add census data to compare or find any relation between such as Income or Age to type of restaurants. I hope this project will help you to see New York’s boroughs restaurants in different way.