Wednesday, January 18

Data Science Project - British Airways


Introduction:

This work is a part of a virtual internship with British Airways, where we will be analyzing and interpreting a British Airways data. 
The goal of this project is to gain insights and identify patterns in the data that can potentially improve the overall performance and customer satisfaction of British Airways.

This project is composed of two main tasks:
The first task is focused on scraping data from a third-party website called SKYTRAX, which provides customer reviews and ratings of various airlines. We will use this data to gain insights into customer perceptions and satisfaction with British Airways. 

The second task is centered around building a predictive model using data provided by British Airways. The goal is to use this data to predict future customers based on multiple features. 

By completing these two tasks, we aim to gain a comprehensive understanding of British Airways' customers and identify areas for improvement.


Task 1: Data Scraping, Preprocessing and Analyzing


For the first task, we utilized the Beautiful Soup python library to scrape more than 3,000 reviews from the SKYTRAX website. 
The data collected included information on customer ratings, reviews, and demographics. 
After collecting the data, we performed extensive data cleaning and preprocessing to ensure the data was in a usable format for analysis. We then extracted valuable findings from the data, including customer reviews towards specific aspects of British Airways' service, such as cabin comfort, entertainment, WIFI, etc.

These findings were represented in a slide deck, which was used to provide an overview of the results and identify key areas for improvement. 
The data collected from this task provided a comprehensive understanding of customer perceptions and satisfaction with British Airways and helped us to identify areas where the company could improve its service.





Task 2: Feature Engineering, Building ML Models and Evaluation


The second task involved working with a dataset of 50,000 lines provided by British Airways. 
The main challenges in this dataset were large categorical features and an unbalanced dataset. To address these challenges, the team first took care of the categorical data by using feature encoding for low-level categorical features and target encoding for large-level categorical features. 
This approach helped to convert the categorical data into numerical values, which can be easily handled by machine learning models.

To address the unbalanced dataset, the team used two techniques: SMOTE (Synthetic Minority Over-sampling Technique) and random under-sampling. These techniques were applied to the train set only after splitting the data into train and test sets. 

SMOTE generates synthetic data points for the minority class, in order to balance the class distribution. The synthetic samples are generated by interpolating between existing minority samples. 

On the other hand, random under-sampling technique randomly removes some of the majority class samples, so that the class distribution becomes balanced. Both techniques increase the number of minority class samples and decrease the number of majority class samples, allowing the model to learn better. 

After balancing the dataset, I trained three different machine learning models: a Random Forest classifier, an XGBoost classifier, and a CatBoost classifier
The performance of these models was then evaluated using the F1 score and the AUC (Area Under the Curve) score as metrics. 

The F1 score is a measure of a model's accuracy that takes into account both precision and recall. It is commonly used in imbalanced classification problems. 

The AUC score is a measure of a model's performance at distinguishing between positive and negative classes. It ranges from 0 to 1, with 1 being a perfect model and 0 being a completely wrong model.

After evaluating the models, it was found that the XGBoost classifier had the best performance. 
The XGBoost algorithm is an optimized version of the gradient boosting algorithm and is known for its high performance and ability to handle large datasets with multiple features.




Project Resources:

Tech used in this project: Python, Beautiful Soup, Scikit-learn, Random Forest, XGBoost, CatBoost













Share:

0 comments:

Post a Comment