Sunday, May 21

Demystifying Semantic Search and Question Answering




Introduction to Semantic Search

Semantic search is a fundamental NLP task that aims to bridge the gap between user queries and the underlying meaning of textual content. 

It goes beyond simple keyword matching, focusing on understanding context and delivering accurate results. 

While LLMs have advanced NLP, semantic search remains crucial, providing a complementary approach to achieve precise and context-aware search results. In this blog, we explore the importance of semantic search and its integration with question answering models, along with creating a user-friendly web interface.

1. Data Preprocessing



   - Understanding the Importance of Data Preprocessing

Data preprocessing plays a crucial role in semantic search as it lays the foundation for accurate and meaningful analysis. 

By extracting text from PDFs and employing techniques like sentence tokenization with Sentence Transformer, we enhance the quality of the data, enabling better understanding and analysis of the content during the semantic search process.

   - Extracting Text from PDFs: A Crucial First Step

Extracting text from PDFs is a crucial first step in the data preprocessing phase. To accomplish this, we employ the powerful pdfplumber Python library. 

By leveraging its functionality, we can efficiently extract the text content from PDF files, ensuring that we have the necessary textual data to perform subsequent semantic search and question answering tasks accurately and effectively.

   - Sentence Tokenization with Sentence Transformer: Enhancing Text Analysis

Sentence tokenization with the Sentence Transformers Python package is a pivotal step in enhancing text analysis for semantic search. 

By breaking down the extracted text into individual sentences, we can achieve a finer level of granularity, facilitating improved similarity measurements. 

Leveraging the power of Sentence Transformers, we can identify and compare relevant semantic units within the text, leading to more precise and context-aware search results.


2. Semantic Search



   - Harnessing the Power of Semantic Search

Unlike traditional keyword-based search, semantic search focuses on understanding the context and meaning behind user queries, leading to more accurate and relevant results. 

By analyzing the semantic relationships, concepts, and entities within the text, semantic search enables a deeper level of comprehension and improves the search experience. 

Whether it's finding specific information within scientific papers, legal documents, or news articles, semantic search empowers us to unlock the full potential of vast knowledge repositories and retrieve "precisely" what we need.

   - Exploring Different Similarity Methods for Semantic Search (Faiss, Annoy, and TF-IDF)

In this tutorial, we explore three different similarity methods for semantic search: 

Faiss (Facebook AI Similarity Search): A library for efficient similarity search.

Annoy (Approximate Nearest Neighbors Oh Yeah)): a package developed by Spotify.

TF-IDF (Term Frequency-Inverse Document Frequency), a classic information retrieval technique. 

By leveraging these techniques, we aim to obtain the top two most relevant results from each method, thereby improving search accuracy and effectiveness.

   - Context Injection: Unveiling the Most Relevant Content

In the context of semantic search, the BERT Question Answering (Q&A) model requires both a context and a question (used as the query) to provide accurate answers. 

To ensure we have the necessary context for the Q&A model, we employ context injection by concatenating all the top results obtained from the similarity methods (from the previous step). 

This process creates a comprehensive and informative context that can be utilized effectively by the question answering model to generate precise and contextualized responses.


3. Question Answering with BERT



   - Unleashing the Potential of Question Answering with BERT

By utilizing the Huggingface transformers package and leveraging a pre-trained model called "bert-large-uncased-whole-word-masking-finetuned-squad" which was fine-tuned on the SQuAD dataset, we can achieve accurate and context-aware question answering. 




This model has the following configuration: 

  • 24-layer 
  • 1024 hidden dimension 
  • 16 attention heads 
  • 336M parameters.

This model is capable of comprehending the nuances of the context and question, allowing it to provide detailed answers based on the information present within the text.

While the release of advanced language models like ChatGPT and other LLMs has garnered significant attention, the impact on the popularity of BERT-based question answering models can be observed from the download graph. 

Over time, the number of downloads for BERT models has decreased significantly, reflecting the shift in focus towards broader language models. 

However, it's important to note that BERT still remains a valuable and effective choice for accurate and contextualized question answering tasks.


4. Putting It All Together - Creating a Web Interface with Gradio

   - Enhancing User Experience: 



Creating a user-friendly web interface is essential for enhancing the overall user experience of our semantic search and question answering system. 

Gradio, a Python package, proves to be a key tool in this regard. With its intuitive and straightforward design, Gradio allows us to easily build interactive interfaces without extensive web development knowledge. 

By providing a simple and elegant way to showcase our semantic search and question answering functionalities, Gradio empowers users to interact seamlessly with our system, making it accessible and user-friendly for both technical and non-technical users alike.


5. Conclusion

In this blog, we explored the fascinating world of semantic search and question answering. We delved into the importance of data preprocessing, highlighting the crucial steps of extracting text from PDFs and performing sentence tokenization using Sentence Transformers. We then ventured into the realm of semantic search, discovering the power of different similarity methods like Faiss, Annoy, and TF-IDF to unveil the most relevant content. Leveraging the BERT Question Answering model, we witnessed how it provides accurate and contextualized answers, fueled by its understanding of language and fine-tuning on the SQuAD dataset.

Finally, we witnessed the seamless integration of all these components through the creation of a web interface using Gradio. This interface enhanced the user experience, allowing users to effortlessly interact with our semantic search and question-answering system.


Share:

Tuesday, March 14

Image Generator Web Application


Are you looking for an easy and fun way to generate images from text prompts? Look no further than a web application built with Google Colab, Stable Diffusion, and Gradio! In this blog post, we'll explore how to create an image generator web application using these tools. With just a few clicks, you can create a web app that generates stunning images based on text prompts. And the best part? You don't need any coding experience to get started.




What is Stable Diffusion? 

Stable Diffusion is a powerful deep learning model that can generate high-quality images from text prompts. It is based on the Diffusion Probabilistic Models (DPMs) framework, which is a class of generative models that can capture complex dependencies between variables in a probabilistic manner. Stable Diffusion can generate images that are highly detailed and diverse, making it an excellent tool for artists, designers, and researchers alike.


What is Gradio?

Gradio is a Python library that allows you to quickly create custom user interfaces for your machine learning models. With Gradio, you can create an interactive web interface that lets users experiment with different prompts and see the results in real-time. 
Gradio also supports a wide range of input and output types, making it easy to integrate with a variety of machine learning models and applications. 


Your App in 3 quick steps!

Creating an Image Generator Web Application To create an image generator web application, you'll need to follow these steps: 

1. Create a Google Colab notebook and install the Stable Diffusion package. 
2. Use Stable Diffusion to generate images from text prompts. 
3. Use Gradio to create an interactive and sharable web interface for your image generator.

Conclusion

In conclusion, the image generator web application is an easy and fun way to generate stunning images from text prompts. With just a few clicks and no coding experience necessary, you can create a user-friendly interface that allows users to experiment with different prompts and see the generated images in real-time. . If you enjoyed this post, don't forget to subscribe to our blog for more exciting and informative content. And if you have any questions or feedback, please don't hesitate to reach out to us via our contact links. We look forward to hearing from you!


Project Resources 


Tech used in this project: Python, Gradio, Stable Diffusion 2.0 


Share:

Thursday, March 9

ML-Olympiad Water-Quality-Prediction

 



Introduction

Greetings everyone, I am excited to share my journey participating in the water quality estimation competition. 
The competition required us to build a machine learning model based on the training data provided and predict the water quality estimation for the test dataset accurately. I put my knowledge of machine learning and data analysis into practice to preprocess, analyze, and visualize the data. I explored various regression techniques and hyperparameters to find the best model for this task. After numerous iterations, I was able to build a model that achieved high accuracy in predicting the water quality estimation for the test dataset. 
My hard work and dedication paid off as I secured the 18th position in the competition. I am sharing the code I used for this prediction task (regression) below, hoping that it can help and inspire others to pursue their interests in machine learning.


Machine Learning Models

I utilized three different machine learning models to predict the quality estimation for the test dataset. These models were the Sequential Neural Network, the XGBoost Regressor, and the Random Forest Regressor. 
Through rigorous experimentation and testing, I found that the XGBoost Regressor and the Random Forest Regressor performed the best in terms of prediction accuracy. 



Both models outperformed the Sequential Neural Network in this task, which is a reasonable outcome given the nature of the data. 

The XGBoost Regressor and the Random Forest Regressor are both tree-based models that excel in handling tabular data with multiple levels of categorical data. These models can capture complex interactions between variables, making them particularly well-suited for this type of problem. Ultimately, the XGBoost Regressor had the best performance based on the RMSE metric, followed closely by the Random Forest Regressor. 

Conclusion

I believe that the combination of these two models can provide a robust solution for similar regression problems in the future.
I would like to extend an invitation to try your own model and submit a late entry for the water quality estimation competition. This is an excellent opportunity to put your skills to the test and see how well your model performs against others in the competition. The competition data and rules are still available, so don't hesitate to give it a shot. You might be surprised at how well your model performs. Plus, this competition is an excellent opportunity to learn new techniques, explore new algorithms, and build your portfolio. 
So, why not take a shot and see how your model stacks up against others? Good luck, and happy modeling!


Project Resources

Tech used in this project: Python, Keras, Sklearn, Random Forest, XGBoost
GitHub project linkhttps://github.com/BoulahiaAhmed/ML-Olympiad--Water-Quality-Prediction






Share:

Wednesday, January 18

Data Science Project - British Airways


Introduction:

This work is a part of a virtual internship with British Airways, where we will be analyzing and interpreting a British Airways data. 
The goal of this project is to gain insights and identify patterns in the data that can potentially improve the overall performance and customer satisfaction of British Airways.

This project is composed of two main tasks:
The first task is focused on scraping data from a third-party website called SKYTRAX, which provides customer reviews and ratings of various airlines. We will use this data to gain insights into customer perceptions and satisfaction with British Airways. 

The second task is centered around building a predictive model using data provided by British Airways. The goal is to use this data to predict future customers based on multiple features. 

By completing these two tasks, we aim to gain a comprehensive understanding of British Airways' customers and identify areas for improvement.


Task 1: Data Scraping, Preprocessing and Analyzing


For the first task, we utilized the Beautiful Soup python library to scrape more than 3,000 reviews from the SKYTRAX website. 
The data collected included information on customer ratings, reviews, and demographics. 
After collecting the data, we performed extensive data cleaning and preprocessing to ensure the data was in a usable format for analysis. We then extracted valuable findings from the data, including customer reviews towards specific aspects of British Airways' service, such as cabin comfort, entertainment, WIFI, etc.

These findings were represented in a slide deck, which was used to provide an overview of the results and identify key areas for improvement. 
The data collected from this task provided a comprehensive understanding of customer perceptions and satisfaction with British Airways and helped us to identify areas where the company could improve its service.





Task 2: Feature Engineering, Building ML Models and Evaluation


The second task involved working with a dataset of 50,000 lines provided by British Airways. 
The main challenges in this dataset were large categorical features and an unbalanced dataset. To address these challenges, the team first took care of the categorical data by using feature encoding for low-level categorical features and target encoding for large-level categorical features. 
This approach helped to convert the categorical data into numerical values, which can be easily handled by machine learning models.

To address the unbalanced dataset, the team used two techniques: SMOTE (Synthetic Minority Over-sampling Technique) and random under-sampling. These techniques were applied to the train set only after splitting the data into train and test sets. 

SMOTE generates synthetic data points for the minority class, in order to balance the class distribution. The synthetic samples are generated by interpolating between existing minority samples. 

On the other hand, random under-sampling technique randomly removes some of the majority class samples, so that the class distribution becomes balanced. Both techniques increase the number of minority class samples and decrease the number of majority class samples, allowing the model to learn better. 

After balancing the dataset, I trained three different machine learning models: a Random Forest classifier, an XGBoost classifier, and a CatBoost classifier
The performance of these models was then evaluated using the F1 score and the AUC (Area Under the Curve) score as metrics. 

The F1 score is a measure of a model's accuracy that takes into account both precision and recall. It is commonly used in imbalanced classification problems. 

The AUC score is a measure of a model's performance at distinguishing between positive and negative classes. It ranges from 0 to 1, with 1 being a perfect model and 0 being a completely wrong model.

After evaluating the models, it was found that the XGBoost classifier had the best performance. 
The XGBoost algorithm is an optimized version of the gradient boosting algorithm and is known for its high performance and ability to handle large datasets with multiple features.




Project Resources:

Tech used in this project: Python, Beautiful Soup, Scikit-learn, Random Forest, XGBoost, CatBoost













Share:

Saturday, January 14

Audio Transcription Web Application using Flask

In today's world, audio files are widely used in various fields such as podcasting, voice notes, and more. However, manually transcribing these audio files can be a tedious and time-consuming task. To simplify this process, we have developed a basic audio transcription web application using the Flask framework.


How it works

The application allows users to upload an audio file, and it will return the transcribed text of the audio file. The application uses the SpeechRecognition library to perform the transcription. The SpeechRecognition library is a Python library that helps in working with speech recognition. It supports several engines and APIs, including Google Speech Engine, Google Cloud Speech API, and more.

The application also uses the Flask framework for the web interface. Flask is a lightweight Python web framework that enables us to develop web applications easily. It provides a simple and easy-to-use API for handling requests and responses.


User Interface

The application has a simple user interface that allows users to upload an audio file and view the transcribed text. The user can select the audio file by clicking on the choose file button, and then click the "Transcribe" button. The transcribed text will be displayed on the page.


Conclusion

In conclusion, this audio transcription web application using Flask can be a valuable tool for anyone who needs to transcribe audio files quickly and easily. It is a basic application, but it can be further enhanced with additional features and functionalities. The use of the Flask framework and SpeechRecognition library make it easy to develop and maintain. This application can be a time-saver for podcasters, journalists, students, and anyone who needs to transcribe audio files regularly.


Project resources

Tech used in this project: Python, Flask, CSS, Html
GitHub project link: https://github.com/BoulahiaAhmed/Audio-Transcription-Webapp-using-Flask



Share:

Sunday, January 8

Identifying Credit Card Fraud through Machine Learning Techniques



Introduction

Credit card fraud is a pervasive problem that affects both consumers and financial institutions. With the increase in online transactions, the risk of fraud has also increased. In this project, we aimed to develop a model to detect credit card fraud using machine learning techniques.


Dataset Description

The dataset used in this project consisted of 284,807 transactions, of which 492 were identified as fraudulent.
This represents a fraud rate of 0.172%. The data was highly unbalanced, with a large majority of transactions being non-fraudulent.



The value 1 is for fraudulent transactions, value 0 is for nonfraudulent transactions


Methods and Algorithms

To overcome the issue of unbalanced data, we implemented two techniques: oversampling using Synthetic Minority Oversampling Technique (SMOTE) and Under-sampling using Random Under-Sampling Technique (RUS).
These techniques helped to balance the data and improve the performance of our machine learning models.



SMOTE create elements specifically for the minority class. The algorithm picks examples from the feature space that are close to one another, draws a line connecting the examples, and then creates a new sample at a position along the line.


Random Under-Sampling Technique



RUS involves randomly selecting examples from the majority class and deleting them from the training dataset. In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached.

Using these two techniques, we were able to balance the training data by transforming the original dataset which had 200405 genuine transactions and 354 fraudulent transactions into a balanced dataset with 198276 examples of each class.

This is an important step in training a model for a classification task, as it helps to prevent the model from having a bias towards one class.

Machine Learning Models

In our study, we implemented two machine learning models: a Random Forest classifier and an XGBoost classifier. Our analysis revealed that the Random Forest model exhibited superior precision and F1 score compared to the XGBoost classifier. While the XGBoost classifier demonstrated a higher recall score, the Random Forest model demonstrated a more precise ability to predict the true labels of the data. as reflected in its higher precision and F1 score.

Evaluating the results

The picture on the top presents the test results of the random forest model, while the one on the bottom presents the results on the XGBoost Model. both includes the accuracy, recall, precision, and F1 score.
These metrics can be useful for understanding the strengths and weaknesses of a model and for comparing the performance of different models.



Random Forest Test Results


XGBoost Test Results

Conclusion

In this project, we successfully developed a credit card fraud detection model using machine learning techniques. By implementing oversampling and undersampling techniques, we were able to improve the performance of our models and achieve good results. The Random Forest model was found to be the best performing model in terms of precision and F1 score, while the XGBoost classifier had a better recall score.

Project resources:

Tech used in this project: Python, Sklearn, Random Forest, XGBoost
GitHub project link: https://github.com/BoulahiaAhmed/Credit-Card-Fraud-Detection

Share:

Saturday, December 31

Arabic Quote Generator Using GPT-2


The goal of this project was to utilize a dataset of Arabic quotes in order to fine-tune a GPT-2 model for the generation of Arabic quotations. The process involved importing and pre-processing the dataset, preparing it for use as input for the GPT-2 model, fine-tuning the model, and evaluating the generated quotations.


Part 1: Importing Data & Pre-processing the Arabic Quote Dataset

The first step in this project was to import the Arabic quote dataset and perform any necessary pre-processing. This included cleaning the data and removing any invalid or irrelevant entries.


Part 2: Preparing the Dataset

Once the data had been imported and pre-processed, it needed to be prepared for use as input for the GPT-2 model. This involved converting the data from a dataframe into a text file and ensuring that it was in the proper format for the model to consume.


Part 3: Fine-tuning the GPT-2 Model

With the dataset prepared, the next step was to fine-tune the GPT-2 model using the Arabic quote dataset. This involved training the model on the dataset and adjusting its hyperparameters to optimize performance.


Part 4: Generating Arabic Quotes Based on User Inputs

With the GPT-2 model fine-tuned, the next step was to use it to generate Arabic quotations based on user inputs. This involved providing the model with a prompt and allowing it to generate a quotation in response.


Part 5: Evaluation of Results

The final step in this project was to evaluate the quality of the generated quotations. This was done by comparing them to the original dataset and assessing their relevance, coherence, and overall quality.


Conclusion

Overall, this project was successful in achieving its goal of using a dataset of Arabic quotes to fine-tune a GPT-2 model for the generation of Arabic quotations. The resulting model was able to generate quotations that were relevant, coherent, and of high quality, demonstrating the effectiveness of the fine-tuning process. 


Project resources and overview:

Tech Used in this project: Python, GPT-2.




Share:

Friday, December 30

YTbrief: AI solution that will make your YouTube experience easier

 YouTube one of the Biggest Search Engines, But…

On YouTube, there is a proliferation of clickbait or misleading content, which can be frustrating for users trying to find reliable information. Many videos on YouTube cover multiple topics at the same time, which can make it difficult for users to locate the specific information they are seeking. Longer videos can be challenging to follow and may lack the engagement necessary to keep the viewer's attention until the end.


Solution: YTbrief

With YTbrief (AI-powered application), users can easily navigate to the specific parts of a video that they wish to watch or listen to with just a few simple steps.
By pasting any YouTube link and entering their search query, users can quickly access the desired content within the video, displayed as a YouTube video that starts at the relevant section in response to their queries.

This innovative tool streamlines the video viewing experience, allowing users to efficiently find and enjoy the content that they desire with ease.

YTbrief will allow users to easily skip to the specific parts of a video that they want to watch or listen to by asking questions, in very few steps.

1. Past any YouTube link.

2. Enter your search query and click Search.

3. Display the results (A YouTube video that starts at the relevant section in response to user queries)


It works better with:

The results of this application are particularly effective when applied to podcasts, lectures, educational videos, seminars, and documentaries. The ability to easily skip to specific sections within these types of audio and video content greatly enhances the user's ability to absorb and retain information, making it an invaluable resource for learners and professionals alike.


Project resources and overview:

Tech Used in this project: Python, Streamlit, Cohere.ai, and Faiss.






Share: