Wednesday, January 18

Data Science Project - British Airways


Introduction:

This work is a part of a virtual internship with British Airways, where we will be analyzing and interpreting a British Airways data. 
The goal of this project is to gain insights and identify patterns in the data that can potentially improve the overall performance and customer satisfaction of British Airways.

This project is composed of two main tasks:
The first task is focused on scraping data from a third-party website called SKYTRAX, which provides customer reviews and ratings of various airlines. We will use this data to gain insights into customer perceptions and satisfaction with British Airways. 

The second task is centered around building a predictive model using data provided by British Airways. The goal is to use this data to predict future customers based on multiple features. 

By completing these two tasks, we aim to gain a comprehensive understanding of British Airways' customers and identify areas for improvement.


Task 1: Data Scraping, Preprocessing and Analyzing


For the first task, we utilized the Beautiful Soup python library to scrape more than 3,000 reviews from the SKYTRAX website. 
The data collected included information on customer ratings, reviews, and demographics. 
After collecting the data, we performed extensive data cleaning and preprocessing to ensure the data was in a usable format for analysis. We then extracted valuable findings from the data, including customer reviews towards specific aspects of British Airways' service, such as cabin comfort, entertainment, WIFI, etc.

These findings were represented in a slide deck, which was used to provide an overview of the results and identify key areas for improvement. 
The data collected from this task provided a comprehensive understanding of customer perceptions and satisfaction with British Airways and helped us to identify areas where the company could improve its service.





Task 2: Feature Engineering, Building ML Models and Evaluation


The second task involved working with a dataset of 50,000 lines provided by British Airways. 
The main challenges in this dataset were large categorical features and an unbalanced dataset. To address these challenges, the team first took care of the categorical data by using feature encoding for low-level categorical features and target encoding for large-level categorical features. 
This approach helped to convert the categorical data into numerical values, which can be easily handled by machine learning models.

To address the unbalanced dataset, the team used two techniques: SMOTE (Synthetic Minority Over-sampling Technique) and random under-sampling. These techniques were applied to the train set only after splitting the data into train and test sets. 

SMOTE generates synthetic data points for the minority class, in order to balance the class distribution. The synthetic samples are generated by interpolating between existing minority samples. 

On the other hand, random under-sampling technique randomly removes some of the majority class samples, so that the class distribution becomes balanced. Both techniques increase the number of minority class samples and decrease the number of majority class samples, allowing the model to learn better. 

After balancing the dataset, I trained three different machine learning models: a Random Forest classifier, an XGBoost classifier, and a CatBoost classifier
The performance of these models was then evaluated using the F1 score and the AUC (Area Under the Curve) score as metrics. 

The F1 score is a measure of a model's accuracy that takes into account both precision and recall. It is commonly used in imbalanced classification problems. 

The AUC score is a measure of a model's performance at distinguishing between positive and negative classes. It ranges from 0 to 1, with 1 being a perfect model and 0 being a completely wrong model.

After evaluating the models, it was found that the XGBoost classifier had the best performance. 
The XGBoost algorithm is an optimized version of the gradient boosting algorithm and is known for its high performance and ability to handle large datasets with multiple features.




Project Resources:

Tech used in this project: Python, Beautiful Soup, Scikit-learn, Random Forest, XGBoost, CatBoost













Share:

Saturday, January 14

Audio Transcription Web Application using Flask

In today's world, audio files are widely used in various fields such as podcasting, voice notes, and more. However, manually transcribing these audio files can be a tedious and time-consuming task. To simplify this process, we have developed a basic audio transcription web application using the Flask framework.


How it works

The application allows users to upload an audio file, and it will return the transcribed text of the audio file. The application uses the SpeechRecognition library to perform the transcription. The SpeechRecognition library is a Python library that helps in working with speech recognition. It supports several engines and APIs, including Google Speech Engine, Google Cloud Speech API, and more.

The application also uses the Flask framework for the web interface. Flask is a lightweight Python web framework that enables us to develop web applications easily. It provides a simple and easy-to-use API for handling requests and responses.


User Interface

The application has a simple user interface that allows users to upload an audio file and view the transcribed text. The user can select the audio file by clicking on the choose file button, and then click the "Transcribe" button. The transcribed text will be displayed on the page.


Conclusion

In conclusion, this audio transcription web application using Flask can be a valuable tool for anyone who needs to transcribe audio files quickly and easily. It is a basic application, but it can be further enhanced with additional features and functionalities. The use of the Flask framework and SpeechRecognition library make it easy to develop and maintain. This application can be a time-saver for podcasters, journalists, students, and anyone who needs to transcribe audio files regularly.


Project resources

Tech used in this project: Python, Flask, CSS, Html
GitHub project link: https://github.com/BoulahiaAhmed/Audio-Transcription-Webapp-using-Flask



Share:

Sunday, January 8

Identifying Credit Card Fraud through Machine Learning Techniques



Introduction

Credit card fraud is a pervasive problem that affects both consumers and financial institutions. With the increase in online transactions, the risk of fraud has also increased. In this project, we aimed to develop a model to detect credit card fraud using machine learning techniques.


Dataset Description

The dataset used in this project consisted of 284,807 transactions, of which 492 were identified as fraudulent.
This represents a fraud rate of 0.172%. The data was highly unbalanced, with a large majority of transactions being non-fraudulent.



The value 1 is for fraudulent transactions, value 0 is for nonfraudulent transactions


Methods and Algorithms

To overcome the issue of unbalanced data, we implemented two techniques: oversampling using Synthetic Minority Oversampling Technique (SMOTE) and Under-sampling using Random Under-Sampling Technique (RUS).
These techniques helped to balance the data and improve the performance of our machine learning models.



SMOTE create elements specifically for the minority class. The algorithm picks examples from the feature space that are close to one another, draws a line connecting the examples, and then creates a new sample at a position along the line.


Random Under-Sampling Technique



RUS involves randomly selecting examples from the majority class and deleting them from the training dataset. In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached.

Using these two techniques, we were able to balance the training data by transforming the original dataset which had 200405 genuine transactions and 354 fraudulent transactions into a balanced dataset with 198276 examples of each class.

This is an important step in training a model for a classification task, as it helps to prevent the model from having a bias towards one class.

Machine Learning Models

In our study, we implemented two machine learning models: a Random Forest classifier and an XGBoost classifier. Our analysis revealed that the Random Forest model exhibited superior precision and F1 score compared to the XGBoost classifier. While the XGBoost classifier demonstrated a higher recall score, the Random Forest model demonstrated a more precise ability to predict the true labels of the data. as reflected in its higher precision and F1 score.

Evaluating the results

The picture on the top presents the test results of the random forest model, while the one on the bottom presents the results on the XGBoost Model. both includes the accuracy, recall, precision, and F1 score.
These metrics can be useful for understanding the strengths and weaknesses of a model and for comparing the performance of different models.



Random Forest Test Results


XGBoost Test Results

Conclusion

In this project, we successfully developed a credit card fraud detection model using machine learning techniques. By implementing oversampling and undersampling techniques, we were able to improve the performance of our models and achieve good results. The Random Forest model was found to be the best performing model in terms of precision and F1 score, while the XGBoost classifier had a better recall score.

Project resources:

Tech used in this project: Python, Sklearn, Random Forest, XGBoost
GitHub project link: https://github.com/BoulahiaAhmed/Credit-Card-Fraud-Detection

Share: