https

https://www.overleaf.com/pro ject/5bbdd48863057b0809cda168
https://www.overleaf.com/pro ject/5bbdd48863057b0809cda168
1

Your Pro ject Title Goes Here
A
Mini Project Report Submittedby
Ms. Nikita S. Nale 1841045
Ms. Harshada S. Mane 1841042
In partial fulfillment for the requirement of Laboratory Practice-II of
Ba…elor of Computer Engineering
Under the guidance of
Prof. Padulkar D. M (designation of guide) Department of Computer Engineering
Vidya Pratishthan’s Kamalnayan Ba ja j Institute of Engineering and
Technology
Bhigawan Road, Vidyanagari
Baramati-4131332018-2019

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Vidya Pratishthan’s
Kamalnayan Ba ja j Institute of Engineering and Technology, Baramati
Department of Computer Engineering
Certificate
This is to certify that following students Ms. Nikita S. Nale 1841045
Ms. Harshada S. Mane 1841042
have successfully completed their project work on TITLE OF YOUR PROJECT GOES HERE
during the academic year 2018-2019in the partial fulfillment towards
the completion of Laboratory Practice-II inComputer Engineering.
Pro ject Guide HoD Deptt. of Comp. Engg.
(Prof. Padulkar D. M) (Prof. Mrs. S. S. Nandgaonkar)
Principal
( Dr. R. S. Bichkar)
Internal Examiner External Examiner

Acknowledgments
We feel happy in forwarding this pro ject report as an image of sincere eort. We are
pleased to acknowledge Prof. Padulkar D. M for their invaluable guidance during this
pro ject work. We also equally indebted to our principal Dr. R. S. Bichkarfor his valuable help
whenever needed.
Ms. Nikita S. Nale
Ms. Harshada S. Mane
i

Abstract
As we know, predicting a movie’s success is a dicult problem. Movie’s success
doesn’t depends on only its quality, some external factors such as competing movies,
time of the year aect the success. As these factors impact the BoxOce sales for the
movie opening. We introduce a simple solution for predicting movie success in terms
of Rating and Revenue. As a result this approach achieved decent appraisal, allowing
theatre planning to a certain extent, even for small studios. So the prediction of movie
success is of great importance to the industry. So in this pro ject we focus a detailed study
of logistic regression, Naive Bayes and K- Nearest Neighbours on movie to predict movie
success rate.
ii

Contents
Acknowledgmentsi
Abstract ii
List of Tablesv
List of Figuresvi
1 Introduction1 1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.2 Brief Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.3 Problem Denition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2
2 Literature Survey3
3 Dataset Description4 3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 3.1.1 Purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
3.1.2 Pro ject Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
3.1.3 Assumptions and Dependencies. . . . . . . . . . . . . . . . . . .5
4 Data Preprocessing and visualization6 4.1 Steps in Data Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . .6
4.2 Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
5 Classication7 5.1 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
5.2 Naive Bayes Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . .7
5.3 K-Nearest Neighbours. . . . . . . . . . . . . . . . . . . . . . . . . . . .8
6 Confusion Matrix9 6.1 Analyse Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . .9 6.1.1 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . .9
6.1.2 Naive Bayes Classier. . . . . . . . . . . . . . . . . . . . . . . .10
6.1.3 KNN Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . .10

CONTENTS
6.2 Compare Classiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
7 Result Analysis11 7.1 Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
7.2 Naive Bayes Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . .11
7.3 KNN Classier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
8 Conclusion and Future Work12
A Glossary13 Bibliography16 Identifying the Movie Success Rate
ivVPKBIET, Baramati

List of Tables
6.1 Confusion matrix of Logistic Regression. . . . . . . . . . . . . . . . . . .9
6.2 Confusion matrix of Naive Bayes Classier. . . . . . . . . . . . . . . . .9
6.3 Confusion matrix of KNN Classier. . . . . . . . . . . . . . . . . . . . .9
v

List of Figures
4.1 Movie with year vs rating. . . . . . . . . . . . . . . . . . . . . . . . . .6
vi

1
Introduction
1.1 Overview
The lm industry is one of the largest sources of entertainment in the world. The industry
produces thousands of lms annually and rakes in billions of dollars in revenue. So
the large production houses control the lm industry, with billions of dollars spent on
promotions of movies. Advertising contribute heavily to the total budget of the movies.
Sometimes the failure of is the heavy loss of producer as investment result. If it was
possible to know the success rate of movies, the production houses could adjust the
release of their movies to gain maximum proï¬t. They make a prediction when the
market is on and when it is not. So predicting the opening of movies on BoxOce is very
important to increasing the success rate of movie.
1.2 Brief Description
The ob jective of our pro ject is to predict the success rate of a movie based on attributes
such as the actors involved, directors, year in which they were released, movie genre,
total runtime of movie, user rating, number of votes, total revenue generated by movie,
the overall metascore, age of the users watching and recording the votes or rating, the
geographical areas where movie was released, any other inuences such as political move-
ments, ongoing trends, and so on. The most dicult task to get the dataset for such kind
of predictions and analysis.
Following are steps we performed:
1. Searched for available dataset, pick and create suitable dataset.
2. Lableling the attributes
2. Apply data preprocessing
3. We used this data as an input to the machine learning and data mining algorithms for
prediction of movie success rate.
4. split the data into training and testing data. 5. We have used are Logistic Regression,
KNearest Neighbor, Naïve Bayes Classier
1

CHAPTER 1. INTRODUCTION
6. We have computed the results of algorithms by means of confusion matrix, accuracy,
recall, precision rate
7. Analysis of eect of various attributes on the success rate of movie. These attributes
include rating, votes, actors, directors, revenue and metascore.
1.3 Problem Denition
Identifying the movie success rate based on ratings and revenues of movies to prevent the
loss of production houses and increasing the prot. Identifying the Movie Success Rate
2VPKBIET, Baramati

2
Literature Survey
Darin Im and Minh Thao 1 talk about how they follow the functional steps of data
extraction, data preprocessing, data integration and transformation, feature selection
and nally classication like in 1. They also used an Movie dataset like in 2 and based
on an algorithm designed by them, set parameters to classify the movie as a success or
failure. Although their implementation has shown a high rate of accuracy in prediction,
their algorithm has had drawbacks of bad time complexity,as the initial data retrieval
takes a long time to create a training data set for even a few tuples of data. We plan
on incorporating their idea and taking it ahead by adding our own algorithm to convert
the string value of a classifying parameter, like actor name or revnue of movie and rating
so on, to a numerical value which will then be put into a broader formula in relation to
all classifying parameters of the test data, and hence decide whether the movie will be
successful or not.
3

3
Dataset Description
3.1 Introduction
The movie dataset contains the following attributes:
1) Rank – Rank of the movie
2) Title – Title of the movie
3) Genre – Genre of the movie
4) Description – Description of the movie
5) Director – Director of the movie
6) Actors – Actors of the movie
7) Year – Year of the movie
8) Runtime (Minutes) – Runtime of the movie
9) Rating – Rating of the movie
10) Votes – Votes of the movie
11) Revenue (Millions) – Revenue of the movie
12) Metascore – Metascore of the movie
3.1.1 Purpose
The Purpose of our pro ject is to predict the success rate of a movie based on attributes
such as the actors involved, directors, year in which they were released, movie genre, total
runtime of movie, user rating, number of votes, total revenue generated by movie, the
overall metascore.
3.1.2 Pro ject Scope
This pro ject focuses on prediction of movie success rate based on rating and revenues of
movies. As a result this approach achieved decent appraisal, allowing theatre planning
to a certain extent, even for small studios. So the prediction of movie success is of great
importance to the industry. If it was possible to beforehand the likelihood of success of
4

CHAPTER 3. DATASET DESCRIPTION
the movies, production houses could adjust the release of their movies to gain maximum
prot.
3.1.3 Assumptions and Dependencies
Assumption of movie success rate on opening of Boxoce based on rating and revenue of
movies. Identifying the Movie Success Rate
5VPKBIET, Baramati

4
Data Preprocessing and visualization
4.1 Steps in Data Preprocessing
1.Import the libraries
2.Import the data-set
3.Check out the missing values
4.See the Categorical Values
5.Splitting the data-set into Training and Test Set
4.2 Visualization Figure 4.1: Movie with year vs rating
6

5
Classication
5.1 Logistic Regression
Logistic regression is a statistical method for analyzing a dataset in which there are one or
more independent variables that determine an outcome. The outcome is measured with
a dichotomous variable in which there are only two possible outcomes. The dependent
variable is binary or dichotomous, i.e. it only contains data coded as 1 or 0. The binary
logistic model is used to estimate the probability of a binary response based on one or
more predictor variables 4.
The goal of logistic regression is to nd the best tting model to describe the re-
lationship between the dichotomous characteristic of interest and a set of independent
(predictor or explanatory) variables. Logistic regression equation – Here p is the probability of presence of the characteristic
of interest.
The logistic transformation is dened as the logged odds:
Odds = p/(1-p) and Logit(p) = ln(p/(1-p))
5.2 Naive Bayes Classier
Naïve Bayes Algorithm is a classication technique based on Bayes’ Theorem with
an assumption of independence among predictors. In simple terms, a Naive Bayes clas-
sier assumes that the presence of a feature in a class is unrelated to the presence of
any other feature. Naive Bayes model is easy to build and particularly useful for very
large data sets. Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classication methods.
Formula:
P(A jB ) = P
(B jA )P (A ) P
(B )
7

CHAPTER 5. CLASSIFICATION
5.3 K-Nearest Neighbours
In the classication setting, the K-nearest neighbor algorithm essentially boils down to
forming a ma jority vote between the K most similar instances to a given unseen obser-
vation. Similarity is dened according to a distance metric between two data points. A
popular choice is the Euclidean distance given by
q P
n
i =1 (
x
i
y
i) 2 Identifying the Movie Success Rate
8VPKBIET, Baramati

6
Confusion Matrix
Table 6.1: Confusion matrix of Logistic Regression positive negative
TP=69 FP=6
FN=8 TN=14 Table 6.2: Confusion matrix of Naive Bayes Classier
positive negative
TP=71 FP=4
FN=16 TN=6 Table 6.3: Confusion matrix of KNN Classier
positive negative
TP=74 FP=1
FN=15 TN=7 6.1 Analyse Confusion Matrix
6.1.1 Logistic Regression Accuracy : 85.56 perPrecision : 0.7
Recall : 0.636
9

CHAPTER 6. CONFUSION MATRIX
6.1.2 Naive Bayes Classier
Accuracy : 79.38 perPrecision : 0.6
Recall : 0.272
6.1.3 KNN Classier Accuracy : 83.5 perPrecision : 0.875
Recall : 0.318
6.2 Compare Classiers
The success percentage for all models were nearly the same however the Logistic Regres-
sion and KNN Neighbours model had the highest accuracy in our case for predicting the
movies success. Identifying the Movie Success Rate
10VPKBIET, Baramati

7
Result Analysis
7.1 Logistic Regression
When we consider binary values as input the Logistic regression classier has a good
accuracy of 85.5 percentage. The predictions are quite high, and this algorithm is very
stable when we consider the dataset with more than one independent. variable.
7.2 Naive Bayes Classier
The accuracy for Naive Bayes Classier is 79.3 percentage.
7.3 KNN Classier
Based on the above results, it can be inferred that the K-Nearest Neighbor classier at k
is equal to 5 has a good accuracy of 83.5 percentage.
11

8
Conclusion and Future Work
A larger training set is the key to improving the performance of the model. We need to
consider additional features such as geographic location, age of viewers and voters, current
trends, news analysis, movie plot analysis and social networks data analysis could be done
and the information thus obtained could be added to the training set. We can also use
Google trends result to improve the result.
12

A
Glossary
Denes Terms, Acronyms and abbreviations used in the FRD
13

Annex A
Dene terms, acronyms, and abbreviations used in the FRD
14

Annex B
Dene terms, acronyms, and abbreviations used in the FRD
15

Bibliography
1Darin Im, Minh Thao, Dang Nguyen, Predicting Movie Success in the U.S. market, Dept.Elect.Eng, Stanford Univ., California, December,2011
2Haiyi Zhang, Di Li Jodrey School of Computer Science Acadia University, Canada, Naïve Bayes Text Classier (2007)
16

x

Hi!
I'm Mia

Would you like to get a custom essay? How about receiving a customized one?

Check it out