IMDb Data Visualization


Project Datasets


The IMDB 5000 Movie Dataset is used as the main dataset in my project. This data set has already been processed, and spans 28 variables of different types (numerical, categorical, etc.) including title, number of critic reviews, number of Facebook likes, number of faces in the movie poster, plot keywords, director, actors, budget, IMDB score, and much more. These variable are really useful and able to provide some good data visualization which are informative and interesting to the users. The original data source link can be found here. This dataset are used for my data visualization implementation in Circle packing, Sankey diagram, Scatterplot, Bar Chart, Network graph and Geomap. However, even though the dataset have given a country specified the movie released place, but there's no geo-location provided. In order to implement my data visualization on geomap, I've joined dataset that Mike Bostock used as the World Map example and the data source link can be downloaded here. Since I'm implementing Network graph in this project, I think it is important to included actor and director biography, posters, and also a short overview of movie content. The IMDb dataset don't have these data and therefore I decided to use movie api call provided from The Movie Database to obtain the movie short content, biography and poster image of movie, actor and director.


Data Processing


Most of the data processing are using R script and the script can be downloaded here. For the Network graph, the data processing is still using the R script but has exported to Gephi for calculating the node centrality, degrees and as a quick glance of how my dataset looks like in a force-layout network graph. I've quickly found out that if there too many nodes and links, the clusters has too much overlapped element on top of each other, which is not ideal for the data visualization purpose as we can't see any pattern there. Therefore, I decided to only explore and see the Top 100 Rated Moive Networking and the visualization look like the best fit for my project, which I planned to show the tripartite relationship and network between movie, actor and director.


Motivation


My motivation in picking this datasets is that it provide a rich information and data variables that are able to build many good data visualization! Besides that, it is interesting to look at how an IMDb score got affected by different factors as discussed in the data visualization part of this project. Furthermore, this dataset provide a good overview for user to understand the movie industry, how's the hollywood stars and directors connected to each other, the general view of which movie genres is most profitable in specific time of period, and some other analysis are able to made to using this dataset such as the movies gross earning, profit and return of investment. These provided a good insight for investor or people who like to invest in movie industry. Besides that, this dataset has provide a good opportunity for me to learn different type of diagram that are able to present useful information such as Sankey diagram and the network graph.


License


The IMDb dataset used in this data visualization project contain IMDb 5000 movies which are made available under the Open Database License (ODbL) v1.0

As stated in Section 4.3 in the license:
4.3 Notice for using output (Contents). Creating and Using a Produced Work does not require the notice in Section 4.2. However, if you Publicly Use a Produced Work, You must include a notice associated with the Produced Work reasonably calculated to make any Person that uses, views, accesses, interacts with, or is otherwise exposed to the Produced Work aware that Content was obtained from the Database, Derivative Database, or the Database as part of a Collective Database, and that it is available under this License. a. Example notice. The following text will satisfy notice under Section 4.3: Contains information from DATABASE NAME, which is made available here under the Open Database License (ODbL). DATABASE NAME should be replaced with the name of the Database and a hyperlink to the URI of the Database. “Open Database License” should contain a hyperlink to the URI of the text of this License. If hyperlinks are not possible, You should include the plain text of the required URI’s with the above notice.
We're allowed to use the dataset as long as we properly follow the guideline above.