鲍鱼数据集数据分析和可视化,线性回归预测鲍鱼年龄(基于TensorFlow)
一:數據集描述
Name?? ??? ?Data Type?? ?Meas.?? ?Description
 ?? ?----?? ??? ?---------?? ?-----?? ?-----------
 ?? ?Sex?? ??? ?nominal?? ??? ??? ?M, F, and I (infant)
 ?? ?Length?? ??? ?continuous?? ?mm?? ?Longest shell measurement
 ?? ?Diameter?? ?continuous?? ?mm?? ?perpendicular to length
 ?? ?Height?? ??? ?continuous?? ?mm?? ?with meat in shell
 ?? ?Whole weight?? ?continuous?? ?grams?? ?whole abalone
 ?? ?Shucked weight?? ?continuous?? ?grams?? ?weight of meat
 ?? ?Viscera weight?? ?continuous?? ?grams?? ?gut weight (after bleeding)
 ?? ?Shell weight?? ?continuous?? ?grams?? ?after being dried
 ?? ?Rings?? ??? ?integer?? ??? ??? ?+1.5 gives the age in years
共9個屬性,最后一個屬性(Rings)代表鮑魚的年輪,和樹木一樣,一年鮑魚生長一出一個年輪
?
數據分析:
? ? 1.導入相關的第三方庫:
? ? 我在ipython上進行的,所以添加魔法函數%matplotlib inline讓繪圖顯示
%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns? ? 2.讀入數據
? ? ? ? 利用pandas讀取數據和分析數據
data = pd.read_csv('dataset.data')? ? ? ? 使用.info()方法查看數據集的總體信息
? ?data.info()
可以看到,共有4176條數據,9個特征,沒有缺失值,除了年輪數據為int64,其他均為float64?
因為原文件中,沒有特征項的名稱,我們加上特征名稱,方便后續操作
data.columns = ['Sex', 'Length', 'Diameter', 'Height','Whole weight', 'Shucked weight', 'Viscera weight','Shell weight', 'Rings']?下面是添加了列索引后的前五行數據:
?
下面看看數據根據性別分類的數據分布:
可以看到,鮑魚性別共有三個分類(M,F,I),分別表示(雄性,雌性,未成年)?
不同性別所占的數據為:M:1527,I:1342,F:1307
使用餅圖直觀的表示,不同性別的分布:
? ? ? ? 獲取類別數:
n = len(data['Sex'].unique())? ? ? ? 獲得類別標簽:
labels = [data['Sex'].unique()[i] for i in range(n)]? ? ? ? ?獲得每個標簽的數據個數:
fraces = [data['Sex'].value_counts()[i] for i in range(n)]? ? ? ? ?繪制餅圖:
explode = [0.1, 0, 0] plt.rcParams['font.sans-serif'] = ['SimHei'] plt.title("鮑魚性別占比") wedges, texts, autotexts = plt.pie(x=fraces, labels=labels, autopct='%0f%%',explode=explode,shadow=True) plt.legend(wedges, labels, fontsize=12, title="性別",loc="center left", bbox_to_anchor=(0.91, 0, 0.3, 1))?
?
針對其他的離散數據,分別查看他們的概率分布密度圖像:
分別使用核密度估計圖和小提琴圖:
sns.kdeplot(data_length)?
?
sns.violinplot(data_length)?
?
根據性別合并查詢,查看不同性別的數據分布:
a = data.drop('Rings', axis=1).groupby('Sex').mean()?
?
繪制分組條形圖:
a.plot(kind='bar', grid=False) plt.title('不同性別鮑魚特征均值') plt.legend(loc="center left", bbox_to_anchor = (1, 0.5))?構建回歸模型:
? ? ? ? ?導入需要的庫:
import tensorflow as tf import numpy as np import pandas as pd from sklearn.utils import shuffle?因為性別標簽的數據是離散的,所以將離散數據轉化為數值型數據:
size_mapping = {'F': 0.1,'M': 0.5,'I': 0.9 } df['Sex'] = df['Sex'].map(size_mapping)數據歸一化:
data = np.array(df.values) n = len(df.columns) for i in range(n-1):data[:,i] = data[:,i]/(data[:,i].max() - data[:,i].min())數據分為x(輸入特征),y(預測數據)
x_data = data[:,:n-1] y_data = data[:,-1]定義特征數據和標簽數據的占位符
x = tf.placeholder(tf.float32, [None, n-1], name='x') y = tf.placeholder(tf.float32, [None, 1], name='y')?定義模型結構:
with tf.name_scope("model"):w = tf.Variable(tf.random_normal([n-1, 1], stddev = 0.01), name = "w")b = tf.Variable(1.0, name = "b")def model(x, w, b):return tf.matmul(x, w) + bpred = model(x, w, b)超參數:
train_epochs = 50 learning_rate = 0.01定義均方損失函數:
with tf.name_scope("LossFunction"):loss_function = tf.reduce_mean(tf.pow(y-pred, 2))創建梯度下降優化器:
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss_function) sess = tf.Session() init = tf.global_variables_initializer()記錄日志文件,方便后續tensorBoard可視化:
logdir = r'C:\Users\yuzhu\Desktop\鮑魚數據集\log'sum_loss_op = tf.summary.scalar("loss", loss_function)merged = tf.summary.merge_all()sess.run(init)創建摘要文件寫入器(FileWriter):
writer = tf.summary.FileWriter(logdir, sess.graph)訓練模型:
loss_list = [] loss_list2 = [] for epoch in range(train_epochs):loss_sum = 0.0for xs,ys in zip(x_data, y_data):xs = xs.reshape(1, n-1)ys = ys.reshape(1, 1)_, summary_str, loss = sess.run([optimizer, sum_loss_op, loss_function], feed_dict = {x:xs, y:ys})writer.add_summary(summary_str, epoch)loss_sum = loss_sum + lossloss_list2.append(loss)xvalues, yvalues = shuffle(x_data, y_data)b0temp = b.eval(session=sess)w0temp = w.eval(session=sess)loss_average = loss_sum/len(y_data)loss_list.append(loss_average)print("epoch=", epoch+1, "loss=", loss_average, "b=", b0temp, "w=", w0temp)繪制損失值的變化情況:
plt.plot(loss_list)?
plt.plot(loss_list2)?tensorBoard可視化結果(損失值):
?
總結
以上是生活随笔為你收集整理的鲍鱼数据集数据分析和可视化,线性回归预测鲍鱼年龄(基于TensorFlow)的全部內容,希望文章能夠幫你解決所遇到的問題。
                            
                        - 上一篇: Nature:鲍哲南团队研发新型可穿戴显
 - 下一篇: 软件工程之他见——鲍周晓