Downloading entire dataset

The Movie Database Support

posted by david.tzoor
STAFFMOD
on March 2, 2015 at 3:09 AM

Hello there,

I'm new to the TMDB. I'm building an app as project in college and I need a movie database.

Is there a way to download the entire dataset to be used locally? Is it allowed? If not, the only way is to use the TMDB API for every call my app gets? Is there a way to only get the ID of each item in the database?

If my application is to become commercial one day, will it be possible to get more than 30 requests per 10 seconds?

Thanks

49 replies (on page 2 of 4)

• Jump to last post

Previous page • Next page • Last page

Reply by mateinone

STAFFMOD

on January 2, 2017 at 12:22 PM

I that is probably the best option Travis when you get time, if there is a list of invalid ids (and if at all possible adult ids), then it allows people to make less calls, so less stress on the resources here, I know I cycled through a second time to validate data.

I say the adult stuff as I suspect a large portion of people would want to again skip pulling it down.

But agree with everything Adi said as well and don't think the site should have just a full DB dump ready to pull down.

Reply by Adi

STAFFMOD

on January 2, 2017 at 1:34 PM

I grabbed all the adult stuff, as I don't trust any data ever. In the end, I actually sorted out all the adult data, so there wasn't stuff in there which shouldn't be. Then comes the debate as to what constitutes an adult film, or whether adult is the appropriate word to be using! :P I think most of us keep a table of 404 IDs. I can provide one, but mine is only up-to-date to about April at the sec.

I tend to prefer not getting the latest X amount of additions as I prefer to wait for the curators to go around, remove the duplicates, fix stuff, add the essentials etc.

We live in a world where everything has to be the latest, the most complete etc. Yet when you actually start working with the stuff, you find that 75% of it will never get used and you value quality over quantity. (This is coming from someone who is very OCD over completeness and accuracy.) I stopped trying to fix things, other than fixing them on TMDB itself, as given enough time, it will get fixed by someone anyhow. IMDB has over 1 million entries, yet I think over 700,000 of those are short films. Do you really want those? I should check to see how many have over 10 reviews :P

The way I originally constructed my DB is very different to how I have it now. My approach to updating it has also changed a lot over the years.

I think people fail to appreciate just how much data there is. The amount of images is just eye watering!

I bet Travis dreams of how he would do it all differently if he could start from scratch! It is the developers dream!

Reply by Travis Bell

STAFFMOD

on January 2, 2017 at 1:39 PM

The way I originally constructed my DB is very different to how I have it now.

Me too man!

I bet Travis dreams of how he would do it all differently if he could start from scratch! It is the developers dream!

Oh man, truer words have never been spoken!

I think people fail to appreciate just how much data there is. The amount of images is just eye watering!

I posted this 2016 feature last week, and indeed, we added over 477,000 images last year alone!

Reply by Denny

STAFFMOD

on January 3, 2017 at 9:02 AM

Offering a dump just gives them more headaches and eats lots of bandwidth as people download it, thinking it will be handy and then never use it.

Well, of course it is work for extracting a dump and to put it online.

Ultimately, the way I have my data set up and indexed is probably very different to how mateinone has his or how the TMDB is originally set up.

On the other hand I would pretty much not care about the way the data is set up as I would get the data as it is set up and there would be no need thinking about the design of my data base. There I see the unconvenience. Don't get me wrong: I don't want to nag about the way the data is provided now - I am even thankful for the chance to have a nice data source and thank you for the work that you put in this for years now. But I just want you to understand my point. Firstly, a dump isn't that hard to extract - as it's even automatable like a charm. Secondly, you don't have to use your own bandwith. There are thousands of services avaiable for uploading large files. They may be ad-supported, but those who are interested..would they really care? Well, I wouldn't. So what I don't get is, that @Tokubetsu keeps whitewashing this unconvenience since even he has got his issues with the data...

Reply by Adi

STAFFMOD

on January 3, 2017 at 6:53 PM

I think it is more an understanding of how the world works.

So you do a data dump and upload it to a file storage site.

First off, you get those who complain about it not being in the format that they want it in.

You get those who complain about the flat format it is presented in, just like we do with the IMDB dumps.

Then you get those who complain because you uploaded it to a site they aren't a fan of using.

After that, you have those who complain that you are obviously a moron, as now DB structure should ever be done like that and you should do it like.....

That is a whole lot of issues and threads and all sorts to deal with.

Or you could just not do it and then you just have this one thread, which is a lot easier to deal with.

Simply put, data dumps create a whole can of worms, which if you are ready for it, fine, but if you are fighting to keep your head above water, it just isn't worth the grief.

With this sort of thing, you try and do work which reduces issues or adds improvements for the masses. Working on something which creates more issues than it solves and only helps the few, generally ends up being way down on the priority list.

In an age when many film sites are revoking API access to the public, let alone data dumps, I am grateful for the access to an API, let alone anything else.

When it comes to development, the coding and implementation is often the easy part. It is everything which surrounds it which is tricky and takes time and can often be the reason not to do it at all. I think that is the bit which needs to be understood. It isn't just about your use case, it is about everyone's use case. Just because you may not complain, that doesn't mean lots of other people won't, because it isn't quite right for what they want.

There are things on TMDB where I don't understand why it doesn't get changed, it would take seconds to fix, programming wise. Does it get fixed? No. Why? Reasons. Is this the same for my site? Hell yes. That is the nature of the beast.

Reply by Denny

STAFFMOD

on January 4, 2017 at 6:50 AM

Thanks @Tokubetsu for your statements and clarification. I got the idea. I think I am a bit desperate about how to build up a good structure in the remaining time. The clock is ticking and I strugle on designing the database right in order to fill it via the provided API. If there is someone willing to help - I would appreciate it!

Reply by Adi

STAFFMOD

on January 5, 2017 at 12:35 AM

My advice to you, would be to set up a script to start calling the API and saving it out as JSON files. I don't do it particularly efficiently (try and get as close to the 30 calls a min cap) and I think it takes me around 2 weeks to get everything.

Do I do it directly into a DB? No. I save everything as JSON files. That way if something goes wrong, or a create my DB structure wrong, it doesn't matter, I can just start again and re-inhale the JSON files.

This also allows me to check to see what the maximum field length is etc. before sticking it into the DB. Hell, you can keep track of those things as you download them. Since you have that bit of time between each call, you might as well do something with it.

Whilst it is all downloading, you can look at the data and start playing with how you want to structure it. If it is purely for your own use, then it shouldn't be very difficult. Just index it up and you should be good. Structure matters more when you are thinking about updating going forward and how well it scales when people are using it.

Also, remember to make use of the Append element of the API. It makes your life easier and hammers TMDB less.

Reply by mateinone

STAFFMOD

on January 5, 2017 at 5:27 AM

I get my database refreshed in about 2 days using between 30-40 transactions per 10 seconds. As the limit is per IP, not per API, you could run 'batches' on another machine and combine them and really get it done in half the time.

As per above the append is your friends and the data I gather is contained in the following, but each person has their own needs and I just include this as an example.

https://api.themoviedb.org/3/movie/$movdb?api_key=$api&language=en&append_to_response=images,alternative_titles,images,videos,credits,keywords,release_dates,similar_movies,include_image_language=en";

If in a rush, multi threading within your program, or just running the same program multiple times in parallel (whilst load balancing) will allow you to pull it all down fairly quickly

Reply by Bruno Carvalhal

STAFFMOD

on February 2, 2017 at 10:53 AM

If anyone would like to download tMDB entries in bulk, I created a small C# application that reads search keywords from a .csv file and returns json files.

https://github.com/ExtraBB/bulk-tmdb

It takes a while, but adheres to the request rate limit and does the job.

Reply by Brisse

STAFFMOD

on April 6, 2017 at 9:34 PM

A list of invalid ids

https://gist.github.com/bradrisse/06a3ab546eae5b7344aee0a2aed561c5

Reply by Cupidvogel

STAFFMOD

on April 14, 2017 at 12:51 PM

The list of invalid ids is wrong. 2 and 3 are perfectly valid ids, but they are included.

Reply by Adi

STAFFMOD

on April 14, 2017 at 1:29 PM

Always left me wondering if Ariel (2) was a favourite of Travis's?! And why didn't 1 make it?

Reply by Brisse

STAFFMOD

on April 14, 2017 at 1:43 PM

Thanks, Ill fix those and run through the list again to double check each. Based on random testing I believe the list is ~90% accurate. This list, for me, saves about 25% of time spent iterating.

Reply by Travis Bell

STAFFMOD

on April 14, 2017 at 1:49 PM

I'll be publishing an official list soon but thanks for your contributions so far

Reply by Brisse

STAFFMOD

on April 14, 2017 at 1:57 PM

Sounds good, Ill wait for the list instead of updating.

Users In This Discussion

Categories

The Movie Database Support

Support → General