Real Estate Price Prediction
— A Machine Learning and Data Science Project
Abstract
People looking for a new house tend to be more conservative towards the budget and market strategies. They always try to optimize the budget which matches their requirements and needs. Ideal home is something which fulfills and matches the customer requirement with an appropriate budget. So, price prediction is a very important thing when it comes to planning a budget. Mostly people approach a real estate manager or an agent who predicts price. This is helpful but it also increases your expense because you need to provide commission to the manager. Instead, when we have a website which predicts price for us, the time is saved and money too. It becomes easy for the customer to access it.
Introduction
There comes a problem to build a model that predict property price based on certain features such as bedroom, bathroom, sq feet, location etc. So, the task is to build a website using HTML/CSS/JavaScript that predicts estimated price for your house.
In terms of project architecture, we are going to take a home set from Kaggle.com. Using that data set we will make a machine learning model. While building the model we are going to cover some data science concepts such as data cleaning, feature engineering, Dimensionality reduction, outliner detection etc.
Once the model is built, we are going to export it to a pickle file and then we will write a python flask server which will consume this pickle file and do price prediction for us.
This python flask server will expose HTTP endpoints for various requests and the UI written in HTML/CSS/JavaScript will make HTTP Get and Post calls.
In terms of tools and technology we will use python as programming language, Pandas for data cleaning, Matplotlib for data visualization, Sklearn for model building, python flask for a back-end server, HTML/CSS/JavaScript for building our website.
Objective
This application will help the buyer invest in estate without approaching the agent. This also decreases the risk in investment. The current trend of buying is hectic and bit expensive as the customer has to roam to various places and also need to pay commission to the agent or the manager. Hence, we will design a website using the data science techniques to overcome all the drawbacks of the currently used system. We are implementing the following in our website:
i) Location/area-based search
ii) Approx. cost of property considering the different attributes and factors.
METHOD OF IMPLEMENTATION
The prediction project carries three major steps which is —
i. Model Building
ii. Python Flask server
iii. Website building
Predicting real estate price involves step wise procedure with detail analysis. Before model building, we need to perform various data science techniques to refine the data. This is done with the help of Machine learning algorithms. The following is the list of steps executed in a respective manner to get the final result.
Data Cleaning: Before the data cleaning process, we downloaded our data set into Pandas and then perform data cleaning using different techniques. We have downloaded the data set from Kaggle.com which is a csv file. Installing Anaconda Distribution which will help us perform our tasks is the nest thing. Data Cleaning process starts with handling the NA values. We will drop all those values and make our data set look clean. It is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers. Hence, using different method and tools we will clean up our data.
Feature Engineering and Dimensionality Reduction Techniques: Feature engineering is considered as a smart technique which uses the domain knowledge of data to create additional features that makes the machine learning algorithm work and makes the model look simpler. If this technique is done correctly, it will increase the predictive power of the algorithm by creating features from the raw data which will help facilitate the process as well as the model.
Dimensionality reduction means the technique which is used to reduce the number of input variables in dataset. Large number of inputs can cause a low performance power of the algorithm which leads to create a poor model. Hence, by adding certain features and deleting certain number of inputs in our model, it will help the model to build more efficiently.
Outlier Detection and Outlier Removal: Outliers are the data point which are data error or sometimes they are not error but they represent you the extreme variations in your data set. Although they are valid, it makes sense to remove them or it will create some issues later on. We will use certain techniques to detect and remove the outliers. Technique used is Standard Deviation.
Model Building: We will make machine learning model which will use k-fold cross validation and grid search cv to come up with best algorithm and best parameters. Along with this we are using pandas’ dummies method to treat the dummy values. While building our model we divide the data set into training and testing data to evaluate the model performance. To find the best optimal model we will use two methods namely k-fold cross validation and GridSearchCV which will tell us which is the best algorithm for our model. We have imported the linear regression, lasso and decision tree algorithm. Out of these the best is selected to build the model.
Python Flask server: Exported pickle file and json file is used to build the python flask server. This flask server is used as a backend for UI application. This is connected with the website and it works as a database to the project.
Building the website: The website is made which makes HTTP calls with the help of the python flask server. The website predicts the price for the user. The user needs to input the predefined parameters in order to get the predicted price.
Conclusion
It is tough in today’s real estate world to store large amount of data and extract them according to one’s need. The data extracted should be useful. This system uses linear regression algorithm and makes the model look optimal. The data is used in the most efficient way using the required algorithm. The model helps to fulfill users need by maintaining the accuracy of estate choice and reducing the risk of investment. In addition to this, adding more databases of other cities will help the customer to explore more estates and reach a more accurate decision. Along with this, factors which affect the house price like recession shall be added. In-depth details of the properties should be added to provide more information to the user. Also, using larger number of data sets will increase the accuracy of the model more. Different model can also be used so that the calculation time decreases and whole process can be carried out in ease.