Hi, so I recently scraped out the whole TMDB data available through the api. I ran into a problem while I was trying to unique index the imdb_id
field in my local mongodb server. It turns out, firstly imdbids like "ttqwerty1" exists. which isnt valid. It can be seen here as an example:
https://www.themoviedb.org/movie/466964-krish/edit?active_nav_item=external_ids
On the db, i'm sure you'll find 3 entries when you run
db.movies.find({"imdb_id": "ttqwerty1"}).
Another example is: "tt7843749"
Secondly, many of the entries share common imdbids. I know in your database you've set tmdb ids to be unique, but shouldn't the expected behaviour be to check if an imdbid has already been assign to another entry in your database rather than directly accepting the result?
I propose 2 checks. One is a regex check, which should be done on the server against /tt\d{7}/
to check for a valid imdb entry from the client and the second is to check if such an imdbid already belongs to some other tmdb id. I'm sure this can be applicable to other fields under "External IDs" for each movie.
I think these checks will add more legitimacy to the external IDs for each movie.
Please let me know what you have to say about this.
Thanks, John
Can't find a movie or TV show? Login to create it.
Want to rate or add this item to a list?
Not a member?
Reply by Travis Bell
on September 5, 2017 at 10:10 AM
Hi John,
Indeed, IMDb ids are unique within each namespace. But they are not enforced to be globally unique.
I only see one of those records now.
With regards to extra validation, yup, I will at some point get to that. I already have an open ticket for that here.
Reply by john1jacob
on September 5, 2017 at 11:37 AM
Ah! I'm sorry, I didn't verify my data integrity. Turns out I've dumped a few rows multiple times. It seems they are all unique
I hope the test for imdb id will be added though! Thanks.