AI Generated Text Detection

AI-Human Text differentiation system fusioning feature engineering and transformer based NLP techniques

keywords: Deep Learning, NLP, AI Text Detection, Classification, Feature Engineering, Machine Learning, Deberta, Bert


Brief Description

With the rise of AI revolution, In recent years, large language models (LLMs) have become increasingly sophisticated, capable of generating text that is difficult to distinguish from human-written text. Modern LLM are so powerful that students could use LLMs to generate essays that are not their own, missing crucial learning keystones, which also bring significant changes in education system.

In this project I developed a Deep Learning based model that can accurately detect whether an essay was written by a student or an LLM which may help the evaluator to take proper action.

Dataset

For the task we needed two type of data

  • Human written text data:
    • persuade corpus 2.0
    • This dataset comprises over 25,000 argumentative essays produced by 6th-12th grade students in the United States for 15 topics.
  • AI generated data
    • For AI generated Data we used different available LLM models (Chat-GPT-3.5, LLAMA-2, Mistral, Gemini) for the same topics as the human written text.

Modeling Approach

  • The task is a Binary classification task.
  • We used two type of modeling approach for the task.
    1. Feature Based ML Model
    2. Deep Learning Based Model

ML Modeling:

For conventional ML model we extracted different features from the dataset. We extracted feature on different level for the model.

  1. Paragraph level features
  2. Sentence level features
  3. Word level features

DL Modeling:

For our task we leverage different transformer base models

  1. Bert-base-cased
  2. Bert-small
  3. Deverta-V3-small


Simple illustration of the project