Data Analysis — Find a good book using Goodreads data and Python

7 min readApr 30, 2022

As a bookworm, finding a great book is like finding a treasure. The best place to find great books for me is Goodreads (https://www.goodreads.com/) since it contains recommendations from readers around the world.

As a data analyst, I would like to dig down Goodreads data to answer some interesting questions related to this platform.

I will use Python and a library called dtale which I think is a fast tool for EDA (Exploratory Data Analysis).

You could find the dataset I used here: https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks

This article was inspired by
Recommending Goodreads Books using Data Mining

Let’s start!!

Data Cleaning

If you load a dataset from Kaggle using pandas library, you will find that there is a problem when you load csv file. Since there are some books that have more than 1 author, the data in the “authors” column will contain a comma and confused column division in pandas (pandas divides column by comma).

I solve this issue using excel (this is not the best practice. It is better to write code for data cleaning in Python because Excel can’t open a big dataset but Excel is faster for me)

Firstly, add filters (Ctrl_Shift+L).
Filter column M. If the data is correct, there must be no data in this column but you will see 4 rows with values.

That is because the authors column is separated into 2 columns as you can see the values in the average rating column is not number.
I fixed them by copy values from the average rating column to the authors column and separate by a semicolon (;), so pandas will not separate this value by comma anymore. Then, I copy values in the rest columns to their place.

Data Overview

📝 Columns description

bookID: A unique identification number for each book.

title: The name under which the book was published.

author: Names of the authors of the book. Multiple authors are delimited with -.

average_rating: The average rating of the book received in total.

isbn: Another unique number to identify the book, the International Standard Book Number.

isbn13: A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.

language_code: Helps understand what is the primary language of the book. For instance, eng is standard for English.

num_pages: Number of pages the book contains.

ratings_count: Total number of ratings the book received.

text_reviews_count: Total number of written text reviews the book received.

Dataset has 11127 rows and 12 columns
No missing values
isbn and isbn13 are not useful in this analysis

Questions I would like to answer

For all questions below, I use only dtale library which is a visualization library for EDA (Exploratory Data Analysis). With this quick tool, 10 interesting questions could be answered by only plotting graphs.

1.How many ratings should be considered high?

I draw a box plot and histogram to illustrate the distribution of book ratings (1 graph should be enough to understand data distribution but I draw 2 graphs to make it easier to explain).
Statistic values are shown along with the plots.
As you can see from a boxplot, most ratings are around 3.75–4.1 (box of box plot that is Q1-Q3). You can also look at a histogram to see that there is a high frequency at 4 and 4.2 bars.
I will consider a book rating over 4.14, which is Q3 of rating data, as a high rating since a book with a rating of 4.14 surpasses 75% of the total book, so it should be considered an A-grade book.

2. How many pages are there in most books?

From these plots, it is clear that there is an outliner book that contains so many pages. This outliner makes the histogram skew to the left.
From the statistic value, we know that the maximum number of pages is 6,576 (If you want to know, it is “The Complete Aubrey/Maturin Novels (5 Volumes)”)
It is better to exclude the books which have more than 2,000 pages before drawing graphs again. Since those books which that number of pages are the minority but they cover around 75% of the histogram area and compress other data on the left side of the graph.
Anyway, I can answer the question by using only statistical values.
192–416 pages are common. That is value range from Q1 to Q3 which includes half of the books. This finding is align with our experience of book range.
There is something to keep in mind in this analysis. Since books reviewed on Goodreads may contain a book set like in the case of a 6,576-page book, the number of pages in these cases is the total page of books in that book set. That is the answer we get can be overestimated (actual Q3 is less than 416 pages).
The list of the top 10 highest page books is shown below.

3. What are the top 10 highest rating books?

There are many books with a rating of 5 out of 5 but these books are rated by a low number of users, so they might not represent the actual rating.
It is better to filter only books whose ratings count more than 10,000 users to make sure that this rating is trustable.

After filtering only books with ratings count more than 10,000 users, many of them are famous such as Harry Potter and The Lord of the Rings.

4. What are the top 10 books which users rate the most?

Most of them have movies, so users may read these books after they watched movies.

5. Which words are used the most in the book title?

The 30 most common words include 1) Article and preposition: the, of, and, a, in, to, for, on 2) Number that is book volume 3) Other words: I, life, stories, guide, history, world, love
The last group (in other words) is the most interesting for me. It implies the trend that many books talk about such as life, love, and history.

6. Which author whites the most books?

Answer: Authors who write the most book (in this dataset) are P.G.Wodehouse, Stephen King, Rumiko Takahashi (Japanese manga artist), Orson Scott Card, and Agatha Christie respectively.
All of them are famous and their books were adapted into movies or series many times.

7. What are the top 10 authors that users rate the most?

The author that users rate the most is J.K Rowling, a Harry Potter writer, she is rated almost twice the second-rated author.

8. What is the language of most books?

Most books are written in English.
Since Goodreads is in English, it is common to see that most books are in English as well.

9. Which publishers publish the most book?

Vintage and Penguin take a lead here.

10. Is there any correlation between the number of pages/the number of ratings/review text length and average rating?

From these 3 scatter plots, the rating is a cluster around 4 as we have seen from the box plot and histogram.
There is no obvious correlation between average rating and the number of ratings/review text length
The book with many pages (more than 2000 pages) seems to have a high rating, the reason may be that publication of these books can be costly, so the books with many pages will be screened more carefully than those with a low number of pages.
However, the ratings of books with several pages less than 2000 distribute around 3.5 to 4.5 are difficult to predict.

Pearson correlation aligns with the results from the scatter plots. The correlation between average rating and other parameters is low (almost zero) for all parameters.

You can find my code and cleaned dataset here: https://github.com/Yannawut/Goodreads_Analysis

This is my first article about data analysis. If you have any comments or suggestions, feel free to tell me.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com