News Text Classification
This project builds a multi-class text classification pipeline to categorize AG News articles into four categories: World, Sports, Business, and Science/Technology. Using the AG News dataset from Kaggle, the system preprocesses raw text, extracts features via TF-IDF vectorization, and trains a Logistic Regression model for classification. The workflow includes comprehensive evaluation with accuracy, classification reports, confusion matrices, and visualizations of the most frequent words per category.
Year
2025
Service
ML Model
Category
NLP
Tool
scikit-learn
Description:
This project focuses on classifying news articles into one of four categories—World, Sports, Business, and Science/Technology—using Natural Language Processing (NLP) and machine learning. The workflow begins with preprocessing the article text through lowercasing, removal of special characters, stopword removal, and lemmatization using NLTK.
The cleaned text is transformed into numerical features using TF-IDF vectorization, capturing the relative importance of words across the dataset. A Logistic Regression model is then trained to perform multi-class classification. Model performance is evaluated using accuracy scores, classification reports, and confusion matrices. Additionally, the most frequent words per category are visualized using Seaborn bar plots for deeper linguistic insights.
The project demonstrates the complete lifecycle of an NLP application: from dataset preparation and text preprocessing to feature engineering, supervised model training, evaluation, and result visualization.
Key Highlights:
Problem: Automatically categorize news articles into predefined topics.
Approach: TF-IDF + Logistic Regression.
Dataset: AG News Dataset from Kaggle.
Deployment: Local execution in Python with visualizations for insights.
Tools:
Python
Pandas
scikit-learn
NLTK
TfidfVectorizer
Matplotlib
Seaborn
tqdm
scikit-learn


