Dataset: ImDB Dataset
The ImDB movie review dataset is a set of 50,000 curated reviews that can be used for research and experimentation on NLP tasks. It is available at: IMDb Movie Reviews Dataset . The dataset consists of 50,000 movie reviews from the IMDb database.
To begin, please download the tar.zip file from the website and save it in your desired directory. Then, you can use the following Python code to extract and process the reviews into a dataframe
import tarfile
import pandas as pd
# Path to the downloaded tar.gz file
tar_file_path = 'path/to/your/tar.gz'
# Extract the tar.gz file
with tarfile.open(tar_file_path, 'r:gz') as tar:
tar.extractall()
# Read the extracted files into a dataframe
data = {'review': [], 'label': []}
for split in ['train', 'test']:
for sentiment in ['pos', 'neg']:
folder_path = f'aclImdb/{split}/{sentiment}'
file_paths = glob.glob(f'{folder_path}/*.txt')
for file_path in file_paths:
with open(file_path, 'r') as file:
review = file.read()
data['review'].append(review)
data['sentiment'].append(1 if sentiment == 'pos' else 0)
reviews = pd.DataFrame(data)
#shuffling the data
reviews = reviews.sample(frac=1).reset_index(drop=True)
reviews.head()
| reviews | sentiment | |
|---|---|---|
| 0 | I'm grading this film on a curve, in other wor... | 1 |
| 1 | I've always liked Sean Connery, but as James B... | 1 |
| 2 | I taped this on Sundance and had no idea that ... | 1 |
| 3 | I love this movie. It's wacky, funny, violent,... | 1 |
| 4 | Edmund Burke said that "all evil needs is for ... | 1 |