kaggle——TMDB 電影票房收入預(yù)測(cè)

印度阿三17 2020-02-25

展開(kāi)全文

介紹

看電影是目前人們休閑娛樂(lè)，消遣時(shí)光的選擇之一。我們都知道，有些電影的票房很高，有的電影票房卻很低，那么決定票房的因素是什么呢？本次將介紹，如何根據(jù)電影上映前的一些信息來(lái)預(yù)測(cè)出該電影的票房。

知識(shí)點(diǎn)

數(shù)據(jù)預(yù)處理
建立預(yù)測(cè)模型

電影票房預(yù)測(cè)介紹

電影產(chǎn)業(yè)在 2018 年估計(jì)達(dá)到 417 億美元，電影業(yè)比以往任何時(shí)候都更受歡迎。那么電影可能跟哪些因素有關(guān)呢？我們可以聯(lián)想到以下幾個(gè)因素。

導(dǎo)演
演員
預(yù)算
預(yù)告片

那是否是這些因素決定了一部電影的最終票房呢？我們可以分析 Kaggle 提供的數(shù)據(jù)來(lái)回答這一問(wèn)題。數(shù)據(jù)詳情可以參考 Kaggle 官方頁(yè)面 ，其主要是電影數(shù)據(jù)庫(kù)中的 7000 多部過(guò)去電影的元數(shù)據(jù)。提供的數(shù)據(jù)信息包括演員，工作人員，情節(jié)關(guān)鍵詞，預(yù)算，海報(bào)，發(fā)布日期，語(yǔ)言，制作公司和國(guó)家等。

圖片描述

本次的目的則是根據(jù)提供的這些信息來(lái)預(yù)測(cè)一部電影最終的票房。

數(shù)據(jù)導(dǎo)入

我們先導(dǎo)入數(shù)據(jù)，并查看數(shù)據(jù)的前 5 份。

鏈接：https://pan.baidu.com/s/1AsZBZNVEHZkoeGm8v2wtGg

import pandas as pd

df = pd.read_csv("TMDB.csv")
df.head()

可以看到數(shù)據(jù)總共含有 23 列，最后一列 revenue 為票房。現(xiàn)在查看數(shù)據(jù)的形狀。

df.shape

可以看到數(shù)據(jù)總共含有 3000 份，查看數(shù)據(jù)的基本信息。

df.info()

df.describe()

查看數(shù)據(jù)集中字符串型特征的描述。

df.describe(include=['O'])

先來(lái)觀察票房前 10 的都是哪些電影。

df.sort_values(by='revenue', ascending=False).head(10)[
    ['title', 'revenue', 'release_date']]

可以看到，票房最高的是《The Avengers》，也就是我們常說(shuō)的《復(fù)仇者聯(lián)盟》。其次是《 Furious 7 》，即《速度與激情 7》。第三是《Avengers: Age of Ultron》，即《復(fù)仇者聯(lián)盟：奧創(chuàng)紀(jì)元》。

數(shù)據(jù)預(yù)處理

數(shù)據(jù)預(yù)處理主要是對(duì)數(shù)據(jù)進(jìn)行清洗，填充缺失值等操作。

上映時(shí)間

在數(shù)據(jù)集中，其中有一列為 release_date，即電影的上映時(shí)間，我們先來(lái)對(duì)該列進(jìn)行處理。在處理時(shí)，將電影上的年、月、日這些信息分開(kāi)，以便后續(xù)的分析。

def date_features(df):
    df['release_date'] = pd.to_datetime(df['release_date'])  # 轉(zhuǎn)換為時(shí)間戳
    df['release_year'] = df['release_date'].dt.year  # 提取年
    df['release_month'] = df['release_date'].dt.month     # 提取月
    df['release_day'] = df['release_date'].dt.day  # 提取日
    df['release_quarter'] = df['release_date'].dt.quarter  # 提取季度
    return df


df = date_features(df)
df['release_year'].head()

查看一下數(shù)據(jù)是否存在異常值，即電影上映時(shí)間超過(guò) 2019 年，因?yàn)槭占臄?shù)據(jù)是 2019 年之前的，所以不可能存在 2019 年之后上映的電影。因此將這些 2019 年之后上映的電影視為異常值。

import numpy as np
# 查看大于 2019 的數(shù)據(jù)
df['release_year'].iloc[np.where(df['release_year'] > 2019)][:10]

從上面的結(jié)果可以看出，的確存在不少異常值，現(xiàn)在對(duì)這些值進(jìn)行處理。

# 大于 2019 的減去 100
df['release_year'] = np.where(
    df['release_year'] > 2019, df['release_year']-100, df['release_year'])
df['release_year'].iloc[np.where(df['release_year'] > 2019)][:10]

處理完成之后，可以看到，已經(jīng)不存在異常值。

現(xiàn)在查看一下關(guān)于日期的數(shù)據(jù)是否存在缺失值。

cols = ['release_year', 'release_month',
        'release_day']
df[cols].isnull().sum()

從上面的結(jié)果可以看到，關(guān)于電影上映日期的數(shù)據(jù)不存在缺失值。

現(xiàn)在查看一下，每個(gè)月的平均電影票房。

from matplotlib import pyplot as plt
%matplotlib inline

fig = plt.figure(figsize=(14, 4))

df.groupby('release_month').agg('mean')['revenue'].plot(kind='bar', rot=0)
plt.ylabel('Revenue (100 million dollars)')

從上圖可以看到，電影的上映時(shí)間主要集中在 6 月和 12 月。這可能的原因是這兩段時(shí)間都是假期，因此很多同學(xué)有更多的時(shí)間去電影院看電影。所以這兩段時(shí)間的電影票房要高一點(diǎn)。

接下來(lái)再來(lái)看每年的電影平均票房數(shù)。

release_year_mean_data = df.groupby(['release_year'])['revenue'].mean()
fig = plt.figure(figsize=(14, 5))  # 設(shè)置畫(huà)布大小
plt.plot(release_year_mean_data)
plt.ylabel('Mean revenue value')  # 設(shè)置 y 軸的標(biāo)簽
plt.title('Mean revenue Over Years')  # 設(shè)置標(biāo)題

從上圖可以看到，電影的每年平均票房都是逐年遞增的，這可能跟我們的經(jīng)濟(jì)增長(zhǎng)有關(guān)，因?yàn)槿藗冊(cè)絹?lái)越有錢(qián)了，花費(fèi)在精神上的消費(fèi)比例也越來(lái)越大了。

接下來(lái)看電影的時(shí)長(zhǎng)跟年份的關(guān)系。

release_year_mean_data = df.groupby(['release_year'])['runtime'].mean()
fig = plt.figure(figsize=(14, 5))  # 設(shè)置畫(huà)布大小
plt.plot(release_year_mean_data)
plt.ylabel('Mean popularity value')  # 設(shè)置 y 軸的標(biāo)簽
plt.title('Mean popularity Over Years')  # 設(shè)置標(biāo)題

從上圖中可以發(fā)現(xiàn)，在 1980 年之前，電影的平均時(shí)長(zhǎng)都是不定的，而 1980 年之后，趨向于穩(wěn)定，差不多是 100 多分鐘。

收藏集

現(xiàn)在來(lái)看 belongs_to_collection 列，先打印該列的前 5 個(gè)數(shù)據(jù)來(lái)進(jìn)行觀察。

for i, e in enumerate(df['belongs_to_collection'][:5]):
    print(i, e)
    print(type(e))

從上面的結(jié)果可以看到，該列主要包括名字、海報(bào)等信息。同時(shí)還可以看到，存在許多值為 nan ，也就是缺失值?，F(xiàn)在統(tǒng)計(jì)一下存在多少個(gè)缺失值。這里需要注意的是通過(guò)判斷該列的值是否是字符串來(lái)判斷是否存在值或?yàn)榭罩怠?/p>

df['belongs_to_collection'].apply(
    lambda x: 1 if type(x) == str else 0).value_counts()

從上面的結(jié)果看出，在 3000 份數(shù)據(jù)中，該列的缺失值就有 2396。我們從該列中提取 name 屬性。且創(chuàng)建一列保存是否缺失。

df['collection_name'] = df['belongs_to_collection'].apply(
    lambda x: eval(x)[0]['name'] if type(x) == str else 0)
df['has_collection'] = df['belongs_to_collection'].apply(
    lambda x: 1 if type(x) == str else 0)
df[['collection_name', 'has_collection']].head()

電影類(lèi)型

同樣的方法，把 genres 列也處理一下。

for i, e in enumerate(df['genres'][:5]):
    print(i, e)

從上可以看出，genres 列主要用來(lái)存放電影的類(lèi)型，例如：喜劇、劇情等。我們可以統(tǒng)計(jì)每中類(lèi)型的電影數(shù)量，先統(tǒng)計(jì)每部電影都含哪些類(lèi)別。

list_of_genres = list(df['genres'].apply(lambda x: [i['name']
                                                    for i in eval(x)] if type(x) == str else []).values)
list_of_genres[:5]

計(jì)算每種電影類(lèi)型出現(xiàn)的數(shù)量。

from collections import Counter

most_common_genres = Counter(
    [i for j in list_of_genres for i in j]).most_common()
most_common_genres

繪制出圖形。

fig = plt.figure(figsize=(10, 6))
data = dict(most_common_genres)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True),
         values, tick_label=names, color='teal')
plt.xlabel('Count')
plt.title('Movie Genre Count')
plt.show()

從上圖可知，電影數(shù)量最多的題材為劇情（Drama），其次是喜?。–omedy）。我們還可以使用詞圖的方法來(lái)直觀的畫(huà)出。先安裝 詞云庫(kù) wordcloud 。

!pip install wordcloud

畫(huà)出詞云圖。

from wordcloud import WordCloud

plt.figure(figsize=(12, 8))
text = ' '.join([i for j in list_of_genres for i in j])
# 設(shè)置參數(shù)
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
                      width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top genres')
plt.axis("off")
plt.show()

在上面的詞圖中，詞的字體越大，表示該詞數(shù)量越多，即出現(xiàn)的頻率越高。

在該列中，我們可以提取一部電影包含類(lèi)型的數(shù)量，以及該電影所屬的全部類(lèi)型。

df['num_genres'] = df['genres'].apply(
    lambda x: len(eval(x)) if type(x) == str else 0)
df['all_genres'] = df['genres'].apply(lambda x: ' '.join(
    sorted([i['name'] for i in eval(x)])) if type(x) == str else '')
top_genres = [m[0] for m in Counter(
    [i for j in list_of_genres for i in j]).most_common(15)]
for g in top_genres:
    df['genre_'   g] = df['all_genres'].apply(lambda x: 1 if g in x else 0)
cols = [i for i in df.columns if 'genre_' in str(i)]
df[cols].head()

在上面顯示的結(jié)果中，genre_Drama、 genre_Comedy 等列即是我們所提取的特征列，其表示的含義是如果一部電影屬于該類(lèi)型，則在該列中的值為 1 否則為 0。這種處理思路類(lèi)似我們常見(jiàn)的 One-Hot 編碼。

前面我們統(tǒng)計(jì)出來(lái)每種類(lèi)型的電影數(shù)量?，F(xiàn)在統(tǒng)計(jì)出類(lèi)型與票房和上映年份的關(guān)系。這里我們會(huì)使用到 plotly 庫(kù)來(lái)進(jìn)行繪圖，先安裝相關(guān)的繪圖工具庫(kù)。

!pip install plotly

導(dǎo)入相關(guān)的庫(kù)。

import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=False)

畫(huà)出三者的關(guān)系圖。

drama = df.loc[df['genre_Drama'] == 1, ]  # 得到所有電影類(lèi)型為 Drama 的數(shù)據(jù)
comedy = df.loc[df['genre_Comedy'] == 1, ]
action = df.loc[df['genre_Action'] == 1, ]
thriller = df.loc[df['genre_Thriller'] == 1, ]

drama_revenue = drama.groupby(['release_year']).mean()['revenue']  # 求出票房的平均值
comedy_revenue = comedy.groupby(['release_year']).mean()['revenue']
action_revenue = action_revenue = action.groupby(
    ['release_year']).mean()['revenue']
thriller_revenue = thriller.groupby(['release_year']).mean()['revenue']

revenue_concat = pd.concat([drama_revenue,    # 將數(shù)據(jù)合并為一份
                            comedy_revenue,
                            action_revenue,
                            thriller_revenue],
                           axis=1)

revenue_concat.columns = ['drama', 'comedy', 'action', 'thriller']
revenue_concat.index = df.groupby(['release_year']).mean().index

data = [go.Scatter(x=revenue_concat.index, y=revenue_concat.drama, name='drama'),
        go.Scatter(x=revenue_concat.index,
                   y=revenue_concat.comedy, name='comedy'),
        go.Scatter(x=revenue_concat.index,
                   y=revenue_concat.action, name='action'),
        go.Scatter(x=revenue_concat.index, y=revenue_concat.thriller, name='thriller')]
# 畫(huà)出圖形
layout = go.Layout(dict(title='Mean Revenue by Top 4 Movie Genres Over Years',
                        xaxis=dict(title='Year'),
                        yaxis=dict(title='Revenue'),
                        ), legend=dict(
    orientation="v"))

py.iplot(dict(data=data, layout=layout))

從上圖可以知道，在 2000 年之后，動(dòng)作（Action）題材的電影的票房在逐漸增加，這也從側(cè)面顯示了動(dòng)作電影越來(lái)越受觀眾的青睞。

制片公司

同樣的方法，現(xiàn)在來(lái)看制片公司（production_companies）。

for i, e in enumerate(df['production_companies'][:5]):
    print(i, e)

從上面可知，同一個(gè)電影可能來(lái)源于多個(gè)制片公司。現(xiàn)在來(lái)畫(huà)出制片公司發(fā)行的電影數(shù)量。

list_of_companies = list(df['production_companies'].apply(
    lambda x: [i['name'] for i in eval(x)] if type(x) == str else []).values)
# 得到每個(gè)公司的電影發(fā)行量
most_common_companies = Counter(
    [i for j in list_of_companies for i in j]).most_common(20)
fig = plt.figure(figsize=(10, 6))
data = dict(most_common_companies)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True),
         values, tick_label=names, color='brown')
plt.xlabel('Count')
plt.title('Top 20 Production Company Count')
plt.show()

從上圖可知，Warner Bros 制作的電影最多。Warner Bros 也即是著名的華納兄弟娛樂(lè)公司。

同樣，我們現(xiàn)在要從該列中提取一些重要的信息。這里與電影類(lèi)型的提取類(lèi)似。

df['num_companies'] = df['production_companies'].apply(
    lambda x: len(x) if type(x) == str else 0)
df['all_production_companies'] = df['production_companies'].apply(
    lambda x: ' '.join(sorted([i['name'] for i in eval(x)])) if type(x) == str else '')
top_companies = [m[0] for m in Counter(
    [i for j in list_of_companies for i in j]).most_common(30)]
for g in top_companies:
    df['production_company_'  
        g] = df['all_production_companies'].apply(lambda x: 1 if g in x else 0)

cols = [i for i in df.columns if 'production_company' in str(i)]
df[cols].head()

在上面的提取結(jié)果中，production_company_Warner Bros、production_company_Universal Pictures 等列即是我們所提取的列，其表示的含義是如果一部電影屬于該公司出產(chǎn)，那么該電影在該公司所對(duì)應(yīng)的的列的值為 1 否則為 0。

進(jìn)行上面的提取之后，我們現(xiàn)在來(lái)畫(huà)出幾個(gè)公司制作的電影票房數(shù)量。

Warner_Bros = df.loc[df['production_company_Warner Bros.'] == 1, ]
Universal_Pictures = df.loc[df['production_company_Universal Pictures'] == 1, ]
Twentieth_Century_Fox_Film = df.loc[
    df['production_company_Twentieth Century Fox Film Corporation'] == 1, ]
Columbia_Pictures = df.loc[df['production_company_Columbia Pictures'] == 1, ]

Warner_Bros_revenue = Warner_Bros.groupby(['release_year']).mean()['revenue']
Universal_Pictures_revenue = Universal_Pictures.groupby(
    ['release_year']).mean()['revenue']
Twentieth_Century_Fox_Film_revenue = Twentieth_Century_Fox_Film.groupby(
    ['release_year']).mean()['revenue']
Columbia_Pictures_revenue = Columbia_Pictures.groupby(
    ['release_year']).mean()['revenue']

prod_revenue_concat = pd.concat([Warner_Bros_revenue,
                                 Universal_Pictures_revenue,
                                 Twentieth_Century_Fox_Film_revenue,
                                 Columbia_Pictures_revenue], axis=1)
prod_revenue_concat.columns = ['Warner_Bros',
                               'Universal_Pictures',
                               'Twentieth_Century_Fox_Film',
                               'Columbia_Pictures']

fig = plt.figure(figsize=(13, 5))
prod_revenue_concat.agg("mean", axis='rows').sort_values(ascending=True).plot(kind='barh',
                                                                              x='Production Companies',
                                                                              y='Revenue',
                                                                              title='Mean Revenue (100 million dollars) of Most Common Production Companies')
plt.xlabel('Revenue (100 million dollars)')

現(xiàn)在來(lái)分析制片公司與年份和票房的關(guān)系。

data = [go.Scatter(x=prod_revenue_concat.index, y=prod_revenue_concat.Warner_Bros, name='Warner_Bros'),
        go.Scatter(x=prod_revenue_concat.index,
                   y=prod_revenue_concat.Universal_Pictures, name='Universal_Pictures'),
        go.Scatter(x=prod_revenue_concat.index,
                   y=prod_revenue_concat.Twentieth_Century_Fox_Film, name='Twentieth_Century_Fox_Film'),
        go.Scatter(x=prod_revenue_concat.index, y=prod_revenue_concat.Columbia_Pictures, name='Columbia_Pictures'), ]

layout = go.Layout(dict(title='Mean Revenue of Movie Production Companies over Years',
                        xaxis=dict(title='Year'),
                        yaxis=dict(title='Revenue'),
                        ), legend=dict(
    orientation="v"))
py.iplot(dict(data=data, layout=layout))

出版國(guó)家

上面一小節(jié)主要分析了制片公司，現(xiàn)在來(lái)分析一下電影的出版國(guó)家，即電影是哪一個(gè)國(guó)家搞出來(lái)的。

for i, e in enumerate(df['production_countries'][:5]):
    print(i, e)

從上面可以看到，在 production_countries 中，name 表示的是國(guó)家的全稱(chēng)，而 iso_3166_1 表示的是國(guó)家的簡(jiǎn)稱(chēng)?，F(xiàn)在我們來(lái)看一下哪個(gè)國(guó)家出產(chǎn)的電影更多。

list_of_countries = list(df['production_countries'].apply(
    lambda x: [i['name'] for i in eval(x)] if type(x) == str else []).values)
most_common_countries = Counter(
    [i for j in list_of_countries for i in j]).most_common(20)

fig = plt.figure(figsize=(10, 6))
data = dict(most_common_countries)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True),
         values, tick_label=names, color='purple')
plt.xlabel('Count')
plt.title('Country Count')
plt.show()

從上圖可以看出，美國(guó)出版的電影最多；其次是英國(guó)；再次是法國(guó)。而中國(guó)香港的票房幾乎與中國(guó)內(nèi)陸持平，這似乎有點(diǎn)出乎意料。

同樣的方法，我們現(xiàn)在來(lái)對(duì)電影出產(chǎn)國(guó)家進(jìn)行特征提取。

df['num_countries'] = df['production_countries'].apply(
    lambda x: len(eval(x)) if type(x) == str else 0)
df['all_countries'] = df['production_countries'].apply(lambda x: ' '.join(
    sorted([i['name'] for i in eval(x)])) if type(x) == str else '')
top_countries = [m[0] for m in Counter(
    [i for j in list_of_countries for i in j]).most_common(25)]
for g in top_countries:
    df['production_country_'  
        g] = df['all_countries'].apply(lambda x: 1 if g in x else 0)

cols = [i for i in df.columns if 'production_country' in str(i)]
df[cols].head()

在所提取到的特征列中，如果一部電影屬于某個(gè)國(guó)家，那么該電影在某個(gè)國(guó)家所對(duì)應(yīng)的的列中的值為 1 ，否則為 0。

電影語(yǔ)言

我們都知道，不同國(guó)家可能使用不同的語(yǔ)言，所以電影的語(yǔ)言也不盡相同?，F(xiàn)在來(lái)看電影語(yǔ)言列（spoken_languages）。

for i, e in enumerate(df['spoken_languages'][:5]):
    print(i, e)

在該列中，name 表示電影語(yǔ)言，iso_639_1 表示語(yǔ)言的簡(jiǎn)寫(xiě)。同時(shí)還可以看到，一部電影可能還有多個(gè)語(yǔ)言。現(xiàn)在對(duì)語(yǔ)言進(jìn)行統(tǒng)計(jì)，查看一下什么語(yǔ)言的電影最多。

list_of_languages = list(df['spoken_languages'].apply(
    lambda x: [i['name'] for i in eval(x)] if type(x) == str else []).values)

most_common_languages = Counter(
    [i for j in list_of_languages for i in j]).most_common(20)

fig = plt.figure(figsize=(10, 6))
data = dict(most_common_languages)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True), values, tick_label=names)
plt.xlabel('Count')
plt.title('Language Count')
plt.show()

可能你也已經(jīng)猜到，英語(yǔ)肯定是最多的，從上圖顯示的結(jié)果也的確如此。同樣的方法來(lái)對(duì)語(yǔ)言提取特征。

df['num_languages'] = df['spoken_languages'].apply(
    lambda x: len(eval(x)) if type(x) == str else 0)
df['all_languages'] = df['spoken_languages'].apply(lambda x: ' '.join(
    sorted([i['iso_639_1'] for i in eval(x)])) if type(x) == str else '')
top_languages = [m[0] for m in Counter(
    [i for j in list_of_languages for i in j]).most_common(30)]
for g in top_languages:
    df['language_'  
        g] = df['all_languages'].apply(lambda x: 1 if g in x else 0)
cols = [i for i in df.columns if 'language_' in str(i)]
df[cols].head()

關(guān)鍵字

在數(shù)據(jù)集中，存在一個(gè)關(guān)鍵字列（Keywords）。我們使用同樣的方法來(lái)處理該列。

for i, e in enumerate(df['Keywords'][:5]):
    print(i, e)

關(guān)鍵字表示的一部電影的主題內(nèi)容。例如在犯罪題材的電影中，關(guān)鍵字就可能有警察、毒梟等關(guān)鍵字?，F(xiàn)在對(duì)關(guān)鍵字進(jìn)行統(tǒng)計(jì)。

list_of_keywords = list(df['Keywords'].apply(
    lambda x: [i['name'] for i in eval(x)] if type(x) == str else []).values)

most_common_keywords = Counter(
    [i for j in list_of_keywords for i in j]).most_common(20)

fig = plt.figure(figsize=(10, 6))
data = dict(most_common_keywords)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True),
         values, tick_label=names, color='purple')
plt.xlabel('Count')
plt.title('Top 20 Most Common Keyword Count')
plt.show()

從上面的結(jié)果看出，女導(dǎo)演（woman director）出現(xiàn)的次數(shù)最多?，F(xiàn)在我們可以分析一下，一些電影題材的關(guān)鍵字。

text_drama = " ".join(review for review in drama['Keywords'].apply(
    lambda x: ' '.join(sorted([i['name'] for i in eval(x)])) if type(x) == str else ''))
text_comedy = " ".join(review for review in comedy['Keywords'].apply(
    lambda x: ' '.join(sorted([i['name'] for i in eval(x)])) if type(x) == str else ''))
text_action = " ".join(review for review in action['Keywords'].apply(
    lambda x: ' '.join(sorted([i['name'] for i in eval(x)])) if type(x) == str else ''))
text_thriller = " ".join(review for review in thriller['Keywords'].apply(
    lambda x: ' '.join(sorted([i['name'] for i in eval(x)])) if type(x) == str else ''))

wordcloud1 = WordCloud(background_color="white",
                       colormap="Reds").generate(text_drama)
wordcloud2 = WordCloud(background_color="white",
                       colormap="Blues").generate(text_comedy)
wordcloud3 = WordCloud(background_color="white",
                       colormap="Greens").generate(text_action)
wordcloud4 = WordCloud(background_color="white",
                       colormap="Greys").generate(text_thriller)


fig = plt.figure(figsize=(25, 20))

plt.subplot(221)
plt.imshow(wordcloud1, interpolation='bilinear')
plt.title('Drama Keywords')
plt.axis("off")

plt.subplot(222)
plt.imshow(wordcloud2, interpolation='bilinear')
plt.title('Comedy Keywords')
plt.axis("off")
plt.show()

fig = plt.figure(figsize=(25, 20))

plt.subplot(223)
plt.imshow(wordcloud3, interpolation='bilinear')
plt.title('Action Keywords')
plt.axis("off")

plt.subplot(224)
plt.imshow(wordcloud4, interpolation='bilinear')
plt.title('Thriller Keywords')
plt.axis("off")
plt.show()

從上面的詞云圖可以看出，劇情（Drama）類(lèi)和喜劇類(lèi)（Comedy）電影的關(guān)鍵字大都都含有家庭（family）、女性（woman）基于小說(shuō)改編（based novel）等，而動(dòng)作類(lèi)（Action）和犯罪類(lèi)（Thriller）則出現(xiàn)警察（police）、死亡（death）等關(guān)鍵詞最多。

同樣的方法來(lái)對(duì)該列進(jìn)行特征提取。

df['num_Keywords'] = df['Keywords'].apply(
    lambda x: len(eval(x)) if type(x) == str else 0)
df['all_Keywords'] = df['Keywords'].apply(lambda x: ' '.join(
    sorted([i['name'] for i in eval(x)])) if type(x) == str else '')
top_keywords = [m[0] for m in Counter(
    [i for j in list_of_keywords for i in j]).most_common(30)]
for g in top_keywords:
    df['keyword_'   g] = df['all_Keywords'].apply(lambda x: 1 if g in x else 0)
cols = [i for i in df.columns if 'keyword_' in str(i)]
df[cols].head()

演員

電影的好壞，演員在很多層面上也取到一定的作用。因此現(xiàn)在來(lái)看演員列。

for i, e in enumerate(df['cast'][:1]):
    print(i, e)

從上面的結(jié)果可以看到，演員的信息包括性別（gender）、姓名（name）等?，F(xiàn)在統(tǒng)計(jì)一下哪些演員演過(guò)的電影最多。

list_of_cast_names = list(df['cast'].apply(
    lambda x: [i['name'] for i in eval(x)] if type(x) == str else []).values)
most_common_keywords = Counter(
    [i for j in list_of_cast_names for i in j]).most_common(20)

fig = plt.figure(figsize=(10, 6))
data = dict(most_common_keywords)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True),
         values, tick_label=names, color='purple')
plt.xlabel('Count')
plt.title('Top 20 Most Common Keyword Count')
plt.show()

從上的結(jié)果可以看到，塞繆爾·杰克遜（Samuel L. Jackson）演過(guò)的電影最多。對(duì)于很多中國(guó)人來(lái)說(shuō)，可能很多的國(guó)人名字不是很容易記住，我們現(xiàn)在來(lái)看一下，這些演員的圖片。

相信看過(guò)美國(guó)大片的你對(duì)上面的演員會(huì)很熟悉。現(xiàn)在來(lái)提取特征。

df['num_cast'] = df['cast'].apply(
    lambda x: len(eval(x)) if type(x) == str else 0)
df['all_cast'] = df['cast'].apply(lambda x: ' '.join(
    sorted([i['name'] for i in eval(x)])) if type(x) == str else '')
top_cast_names = [m[0] for m in Counter(
    [i for j in list_of_cast_names for i in j]).most_common(30)]
for g in top_cast_names:
    df['cast_name_'   g] = df['all_cast'].apply(lambda x: 1 if g in x else 0)
cols = [i for i in df.columns if 'cast_name' in str(i)]
df[cols].head()

畫(huà)出參演數(shù)量最多的演員所獲得的電影票房情況。

cast_name_Samuel_L_Jackson = df.loc[df['cast_name_Samuel L. Jackson'] == 1, ]
cast_name_Robert_De_Niro = df.loc[df['cast_name_Robert De Niro'] == 1, ]
cast_name_Morgan_Freeman = df.loc[df['cast_name_Morgan Freeman'] == 1, ]
cast_name_J_K_Simmons = df.loc[df['cast_name_J.K. Simmons'] == 1, ]


cast_name_Samuel_L_Jackson_revenue = cast_name_Samuel_L_Jackson.mean()[
    'revenue']
cast_name_Robert_De_Niro_revenue = cast_name_Robert_De_Niro.mean()['revenue']
cast_name_Morgan_Freeman_revenue = cast_name_Morgan_Freeman.mean()['revenue']
cast_name_J_K_Simmons_revenue = cast_name_J_K_Simmons.mean()['revenue']


cast_revenue_concat = pd.Series([cast_name_Samuel_L_Jackson_revenue,
                                 cast_name_Robert_De_Niro_revenue,
                                 cast_name_Morgan_Freeman_revenue,
                                 cast_name_J_K_Simmons_revenue])

cast_revenue_concat.index = ['Samuel L. Jackson',
                             'Robert De Niro',
                             'Morgan Freeman',
                             'J.K. Simmons', ]

fig = plt.figure(figsize=(13, 5))
cast_revenue_concat.sort_values(ascending=True).plot(
    kind='barh', title='Mean Revenue (100 million dollars) by Top 4 Most Common Cast')
plt.xlabel('Revenue (100 million dollars)')

現(xiàn)在對(duì)演員性別等特征進(jìn)行提取。

list_of_cast_genders = list(df['cast'].apply(
    lambda x: [i['gender'] for i in eval(x)] if type(x) == str else []).values)
list_of_cast_characters = list(df['cast'].apply(
    lambda x: [i['character'] for i in eval(x)] if type(x) == str else []).values)

df['genders_0'] = sum([1 for i in list_of_cast_genders if i == 0])
df['genders_1'] = sum([1 for i in list_of_cast_genders if i == 1])
df['genders_2'] = sum([1 for i in list_of_cast_genders if i == 2])
top_cast_characters = [m[0] for m in Counter(
    [i for j in list_of_cast_characters for i in j]).most_common(15)]
for g in top_cast_characters:
    df['cast_character_'  
        g] = df['cast'].apply(lambda x: 1 if type(x) == str and g in x else 0)
cols = [i for i in df.columns if 'cast_cha' in str(i)]
dfcols].head()

制作團(tuán)隊(duì)

一部電影的好壞與制作團(tuán)隊(duì)的也是分不開(kāi)的，現(xiàn)在來(lái)看電影的制作團(tuán)隊(duì)。

for i, e in enumerate(df['crew'][:1]):
    print(i, e)

從上面的結(jié)果可以看出，制作團(tuán)隊(duì)包括導(dǎo)演，副導(dǎo)演、電影配樂(lè)等信息?，F(xiàn)在來(lái)統(tǒng)計(jì)一下團(tuán)隊(duì)人物制作的電影數(shù)量。

list_of_crew_names = list(df['crew'].apply(
    lambda x: [i['name'] for i in eval(x)] if type(x) == str else []).values)
most_common_keywords = Counter(
    [i for j in list_of_crew_names for i in j]).most_common(20)

fig = plt.figure(figsize=(10, 6))
data = dict(most_common_keywords)
names = list(data.keys())
values = list(data.values())

plt.barh(sorted(range(len(data)), reverse=True),
         values, tick_label=names, color='purple')
plt.xlabel('Count')
plt.title('Top 20 Most Common Keyword Count')
plt.show()

從上面可以看到 avy Kaufman，Robert Rodriguez 等導(dǎo)演參與制作的電影最多。現(xiàn)在進(jìn)行特征提取。

df['num_crew'] = df['crew'].apply(
    lambda x: len(eval(x)) if type(x) == str else 0)
df['all_crew'] = df['crew'].apply(lambda x: ' '.join(
    sorted([i['name'] for i in eval(x)])) if type(x) == str else '')
top_crew_names = [m[0] for m in Counter(
    [i for j in list_of_crew_names for i in j]).most_common(30)]
for g in top_crew_names:
    df['crew_name_'  
        g] = df['all_crew'].apply(lambda x: 1 if type(x) == str and g in x else 0)
cols = [i for i in df.columns if 'crew_name' in str(i)]
df[cols].head()

同樣對(duì)排名前 4 位導(dǎo)演進(jìn)行分析。

crew_name_Avy_Kaufman = df.loc[df['crew_name_Avy Kaufman'] == 1, ]
crew_name_Robert_Rodriguez = df.loc[df['crew_name_Robert Rodriguez'] == 1, ]
crew_name_Deborah_Aquila = df.loc[df['crew_name_Deborah Aquila'] == 1, ]
crew_name_James_Newton_Howard = df.loc[df['crew_name_James Newton Howard'] == 1, ]

crew_name_Avy_Kaufman_revenue = crew_name_Avy_Kaufman.mean()['revenue']
crew_name_Robert_Rodriguez_revenue = crew_name_Robert_Rodriguez.mean()[
    'revenue']
crew_name_Deborah_Aquila_revenue = crew_name_Deborah_Aquila.mean()['revenue']
crew_name_James_Newton_Howard_revenue = crew_name_James_Newton_Howard.mean()[
    'revenue']


crew_revenue_concat = pd.Series([crew_name_Avy_Kaufman_revenue,
                                 crew_name_Robert_Rodriguez_revenue,
                                 crew_name_Deborah_Aquila_revenue,
                                 crew_name_James_Newton_Howard_revenue])
crew_revenue_concat.index = ['Avy Kaufman',
                             'Robert Rodriguez',
                             'Deborah Aquila',
                             'James Newton Howard']


fig = plt.figure(figsize=(13, 5))
crew_revenue_concat.sort_values(ascending=True).plot(
    kind='barh', title='Mean Revenue (100 million dollars) by Top 10 Most Common Crew')
plt.xlabel('Revenue (100 million dollars)')

從上面的顯示結(jié)果可以看到，電影票房最高的制作人員是詹姆斯·紐頓·霍華德（James Newton Howard），其是一名音樂(lè)家，主要負(fù)責(zé)電影的配樂(lè)。

特征工程

因?yàn)槠狈繑?shù)據(jù)并不平衡，所以要用對(duì)數(shù)變換來(lái)處理傾斜的數(shù)據(jù)。

fig = plt.figure(figsize=(15, 10))

plt.subplot(221)
df['revenue'].plot(kind='hist', bins=100)
plt.title('Distribution of Revenue')
plt.xlabel('Revenue')

plt.subplot(222)
np.log1p(df['revenue']).plot(kind='hist', bins=100)
plt.title('Train Log Revenue Distribution')
plt.xlabel('Log Revenue')

對(duì)預(yù)計(jì)票房列也做同樣的變換。

fig = plt.figure(figsize=(15, 10))

plt.subplot(221)
df['budget'].plot(kind='hist', bins=100)
plt.title('Train Budget Distribution')
plt.xlabel('Budget')

plt.subplot(222)
np.log1p(df['budget']).plot(kind='hist', bins=100)
plt.title('Train Log Budget Distribution')
plt.xlabel('Log Budget')
plt.show()

前面我們主要提取了時(shí)間、演員、導(dǎo)演等特征，而數(shù)據(jù)集還存在電影標(biāo)題、電影編號(hào)等特征，這些特征對(duì)預(yù)測(cè)結(jié)果可能沒(méi)有多大影響，因此，現(xiàn)在刪除掉這些特征，僅保留前面我們所提取的特征列。

drop_columns = ['homepage', 'imdb_id', 'poster_path', 'status',
                'title', 'release_date', 'tagline', 'overview',
                'original_title', 'all_genres', 'all_cast',
                'original_language', 'collection_name', 'all_crew',
                'belongs_to_collection', 'genres', 'production_companies',
                'all_production_companies', 'production_countries',
                'all_countries', 'spoken_languages', 'all_languages',
                'Keywords', 'all_Keywords', 'cast', 'crew']

df_drop = df.drop(drop_columns, axis=1).dropna(axis=1, how='any')

查看最終的數(shù)據(jù)。

df_drop.head()

劃分訓(xùn)練集和測(cè)試集。

from sklearn.model_selection import train_test_split

data_X = df_drop.drop(['id', 'revenue'], axis=1)
data_y = np.log1p(df_drop['revenue'])
train_X, test_X, train_y, test_y = train_test_split(
    data_X, data_y.values, test_size=0.2)

構(gòu)建預(yù)測(cè)模型，并進(jìn)行訓(xùn)練和預(yù)測(cè)。這里使用線性回歸的改進(jìn)版模型 Lasso 。

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
model = Lasso()
model.fit(train_X, train_y)  # 構(gòu)建模型
y_pred = model.predict(test_X)  # 訓(xùn)練模型
mean_squared_error(y_pred, test_y)  # 預(yù)測(cè)模型

Lasso 回歸的預(yù)測(cè)結(jié)果與真實(shí)值的均方差為 6 到 7 左右。同樣的方法，使用嶺回歸（Ridge）重新構(gòu)建模型。

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
model = Ridge()
model.fit(train_X, train_y)
y_pred = model.predict(test_X)
mean_squared_error(y_pred, test_y)

從上面的結(jié)果可知，Ridge 回歸要相比 Lasso 回歸要好一點(diǎn)。

總結(jié)

本次是對(duì)電影票房進(jìn)行預(yù)測(cè)，是一個(gè)典型的回歸任務(wù)。在數(shù)據(jù)預(yù)處理時(shí)，主要是通過(guò)手動(dòng)來(lái)提取特征，并可視化。在數(shù)據(jù)預(yù)處理完成之后，我們還對(duì)原始數(shù)據(jù)的票房列和預(yù)估列進(jìn)行了平滑。在構(gòu)建預(yù)測(cè)模型時(shí)，主要使用常見(jiàn)的 Lasso 模型和 Ridge 模型。

來(lái)源：https://www./content-4-640301.html

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶(hù)發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：印度阿三17 > 《開(kāi)發(fā)》

舉報(bào)/認(rèn)領(lǐng)