IUT - News classification algorithms

Implementation and optimization of text classification algorithms

Cover

Context

Project conducted during the first semester of BUT Computer Science in collaboration with Manu Thuillier .

Objective

Implementation and optimization of automatic text classification algorithms to categorize news articles into 5 categories: politics, culture, environment/technology, economics, and sports.

Implementation

Weight based classification

First approach based on weighted scoring per category for each word:

  • Word normalization (case, plurals, vowels)
  • Binary search optimization
  • Common word filtering
  • ~65% accuracy

K-nearest neighbors

Second approach using vector representations:

  • Articles represented as weight vectors
  • Distance calculations between vectors
  • Category determined by k closest neighbors vote
  • ~75% accuracy with k=5
  • Optimized complexity to O(n log n)

Results

Performance analysis detailed in the attached report showed KNN outperforming the weight-based approach while maintaining reasonable computational complexity through optimizations.