c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择
c盤(pán)擴(kuò)展卷功能只能向右擴(kuò)展
Following feature engineering, this part moves to the next step in the data preparation process: feature scaling and selection, which transforms the dataset to a more digestible one prior to modelling.
在進(jìn)行要素工程之后,此部分將轉(zhuǎn)到數(shù)據(jù)準(zhǔn)備過(guò)程的下一步:要素縮放和選擇,該功能可在建模之前將數(shù)據(jù)集轉(zhuǎn)換為易于消化的數(shù)據(jù)集。
As a reminder, this end-to-end project aims to solve a classification problem in Data Science, particularly in finance industry and is divided into 3 parts:
提醒一下,此端到端項(xiàng)目旨在解決數(shù)據(jù)科學(xué)(特別是金融行業(yè))中的分類(lèi)問(wèn)題,分為三個(gè)部分:
Feature Scaling and Selection (Bonus: Imbalanced Data Handling)
功能縮放和選擇(獎(jiǎng)金:數(shù)據(jù)處理不平衡)
If you have missed the 1st part, feel free to check it out here before going through the 2nd part that follows here for a better context understanding.
如果您錯(cuò)過(guò)了第一部分,請(qǐng)?jiān)?strong>此處先進(jìn)行檢查,然后再閱讀后面的第二部分,以更好地理解上下文。
A.功能縮放 (A. Feature Scaling)
What is feature scaling and why do we need it prior to modelling?
什么是特征縮放?為什么在建模之前需要它?
According to Wikipedia,
根據(jù)維基百科,
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
特征縮放是一種用于規(guī)范自變量或數(shù)據(jù)特征范圍的方法。 在數(shù)據(jù)處理中,這也稱(chēng)為數(shù)據(jù)規(guī)范化,通常在數(shù)據(jù)預(yù)處理步驟中執(zhí)行。
If you recall from the 1st part, we have completed engineering all of our features on both datasets (A & B) as below:
如果您從第一部分回顧,我們已經(jīng)完成了對(duì)兩個(gè)數(shù)據(jù)集(A和B)的所有功能的工程設(shè)計(jì),如下所示:
Dataset A (encoded without target)數(shù)據(jù)集A(無(wú)目標(biāo)編碼) Dataset B (encoded with target)數(shù)據(jù)集B(用目標(biāo)編碼)As seen above, data range and distribution among all features are relatively different from one another, not to mention some variables bearing with outliers. That being said, it is highly recommended that we apply feature scaling to the entire dataset consistently for the purpose of making it more digestible to machine learning algorithms.
如上所述,所有要素之間的數(shù)據(jù)范圍和分布都相對(duì)不同,更不用說(shuō)一些帶有異常值的變量了。 話(huà)雖這么說(shuō),強(qiáng)烈建議我們對(duì)整個(gè)數(shù)據(jù)集一致地應(yīng)用特征縮放,以使其更易于機(jī)器學(xué)習(xí)算法消化。
In fact, there are a number of different methods in the market, but I will only focus on the three which I believe are relatively distinctive: StandardScaler, MinMaxScaler and RobustScaler. In brief,
實(shí)際上,市場(chǎng)上有許多不同的方法,但是我只關(guān)注我認(rèn)為相對(duì)獨(dú)特的三種方法: StandardScaler , MinMaxScaler和RobustScaler 。 簡(jiǎn)單來(lái)說(shuō),
StandardScaler: this method removes the mean and scales the data to unit variance (mean = 0 and standard deviation = 1). However, it is highly influenced by outliers, especially those marginally extreme ones which can spread the scaled data range to further than 1 standard deviation.
StandardScaler :此方法刪除均值并將數(shù)據(jù)縮放為單位方差(平均值= 0,標(biāo)準(zhǔn)偏差= 1)。 但是,它受異常值的影響很大,尤其是那些可以將縮放后的數(shù)據(jù)范圍擴(kuò)展到超過(guò)1個(gè)標(biāo)準(zhǔn)偏差的極限邊緣值。
MinMaxScaler: this method subtracts the minimum value in the feature and divides it by the range (which is the difference between the original maximum and minimum value). Essentially, it rescales the dataset to the range of 0 and 1. However, this method is relatively limited as it compress all data points to a narrow range and it doesn’t help much in the presence of outliers.
MinMaxScaler :此方法減去特征中的最小值并將其除以范圍(這是原始最大值和最小值之間的差)。 本質(zhì)上,它將數(shù)據(jù)集重新縮放到0到1的范圍。但是,此方法相對(duì)有限,因?yàn)樗鼘⑺袛?shù)據(jù)點(diǎn)壓縮到一個(gè)狹窄的范圍,并且在存在異常值時(shí)無(wú)濟(jì)于事。
RobustScaler: this method is based on percentiles, which subtracts the median and divides by the interquartile range (75% — 25%). It is generally more preferable than other two scalers since it is not greatly influenced by large marginal outliers if any.
RobustScaler :此方法基于百分位數(shù),該位數(shù)減去中位數(shù)并除以四分位間距(75%— 25%)。 通常,它比其他兩個(gè)縮放器更可取,因?yàn)樗皇茌^大的邊緣異常值(如果有)的很大影響。
Let’s see how the three scalers differ in our dataset:
讓我們看看三個(gè)縮放器在我們的數(shù)據(jù)集中有何不同:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler#StandardScalerx_a_train_ss = pd.DataFrame(StandardScaler().fit_transform(x_a_train), columns=x_a_train.columns)#MinMaxScaler
x_a_train_mm = pd.DataFrame(MinMaxScaler().fit_transform(x_a_train), columns=x_a_train.columns)#RobustScaler
x_a_train_rs = pd.DataFrame(RobustScaler().fit_transform(x_a_train), columns=x_a_train.columns)StandardScaler標(biāo)準(zhǔn)縮放器 MinMaxScalerMinMaxScaler RobustScaler健壯的潔牙機(jī)
As seen above, the scaled data range of RobustScaler looks more digestible than the other two scalers, which might help the machine learning algorithms push the processing runtime faster and more efficiently. However, this is my assumption prior to modelling, but we can put it on trials when it comes to that phase.
如上所示,RobustScaler的縮放數(shù)據(jù)范圍看起來(lái)比其他兩個(gè)縮放器更易于消化,這可能有助于機(jī)器學(xué)習(xí)算法更快更有效地推動(dòng)處理運(yùn)行時(shí)間。 但是,這是我在建模之前的假設(shè),但是在該階段我們可以進(jìn)行試驗(yàn)。
B.不平衡的數(shù)據(jù)處理 (B. Imbalanced Data Handling)
What is imbalanced data and how should we handle it?
什么是不平衡數(shù)據(jù),我們應(yīng)該如何處理?
In short, imbalanced datasets are those where there is a severe skew in the target distribution which might not be ideal for modelling. As a clearer example, let’s see if our dataset is imbalanced or not:
簡(jiǎn)而言之,不平衡的數(shù)據(jù)集是那些在目標(biāo)分布中存在嚴(yán)重偏斜的數(shù)據(jù)集,可能不適合建模。 作為更清晰的示例,讓我們看看我們的數(shù)據(jù)集是否不平衡:
a_target_0 = df_a[df_a.target == 0].target.count() / df_a.target.count()a_target_1 = df_a[df_a.target == 1].target.count() / df_a.target.count()
The result is that 76% of the data is classified as target 0 while the remaining 24% as target 1.
結(jié)果是76%的數(shù)據(jù)被分類(lèi)為目標(biāo)0,其余24%的數(shù)據(jù)被分類(lèi)為目標(biāo)1。
In fact, there’s no crystal clear benchmark that we should rely on to accurately determine whether our dataset is imbalanced or not. Some says that the ratio of 9:1 while others of 8:2, which indeed really depends on the nature of your dataset as well as the context/problem you are solving. In my case, I take the above result as imbalanced and will “resample” the dataset to make it relatively balanced.
實(shí)際上,沒(méi)有精確的基準(zhǔn)可用來(lái)準(zhǔn)確確定我們的數(shù)據(jù)集是否不平衡。 有人說(shuō)比例為9:1,而其他人則為8:2,這確實(shí)取決于數(shù)據(jù)集的性質(zhì)以及要解決的上下文/問(wèn)題。 就我而言,我將上述結(jié)果視為不平衡,并將對(duì)數(shù)據(jù)集進(jìn)行“重新采樣”以使其相對(duì)平衡。
Just a disclaimer, all the pre-processing steps I have taken here do not mean that they are all must-do and positively impact the accuracy of our model later on. It rather implies that I was aiming to test all possible scenarios that might help improve my models.
只是免責(zé)聲明 ,我在這里采取的所有預(yù)處理步驟并不意味著它們都是必須要做的,并且以后會(huì)對(duì)我們模型的準(zhǔn)確性產(chǎn)生積極影響。 而是暗示我打算測(cè)試所有可能有助于改善模型的場(chǎng)景。
Back to resampling, there are two common methodologies that we might have heard of: over-sampling and under-sampling. In brief,
回到重采樣,我們可能聽(tīng)說(shuō)過(guò)兩種常見(jiàn)的方法:過(guò)采樣和欠采樣。 簡(jiǎn)單來(lái)說(shuō),
Over-sampling: this method duplicates the samples from the minority class and add them to the dataset (training set).
過(guò)度采樣 :此方法復(fù)制少數(shù)類(lèi)的樣本并將其添加到數(shù)據(jù)集(訓(xùn)練集)。
Under-sampling: this is the opposite to the above which deletes some samples from the majority class.
欠采樣 :這與上面的相反,后者從多數(shù)類(lèi)中刪除了一些采樣。
Visually, you can imagine something like this:
在視覺(jué)上,您可以想象這樣的事情:
Image credit: https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb圖片來(lái)源: https : //towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947ebLet’s test both out:
讓我們測(cè)試一下:
# Under-samplingfrom imblearn.under_sampling import RandomUnderSamplerundersample = RandomUnderSampler()x_a_train_rs_under, y_a_train_under = undersample.fit_resample(x_a_train_rs, y_a_train)
print(Counter(y_a_train_under))# Over-sampling
from imblearn.over_sampling import SMOTE
from collections import Counteroversample = SMOTE()x_a_train_rs_over, y_a_train_over = oversample.fit_resample(x_a_train_rs, y_a_train)
print(Counter(y_a_train_over))
For each methodology, there is a variety of options to resample your dataset, but I’ve chosen the most common one to execute, which is RandomUnderSampler (for under-sampling) and SMOTE (for over-sampling). The class distribution after resampling is:
對(duì)于每種方法,都有多種方法可以對(duì)數(shù)據(jù)集進(jìn)行重新采樣,但是我選擇了最常見(jiàn)的選項(xiàng)來(lái)執(zhí)行,即RandomUnderSampler (用于欠采樣)和SMOTE (用于過(guò)采樣)。 重采樣后的類(lèi)分布為:
- RandomUnderSampler: 0: 5814, 1: 5814 RandomUnderSampler:0:5814,1:5814
- SMOTE: 1: 18324, 0: 18324 冒煙:1:18324,0:18324
Both has their pros and cons, but from the name suggests, RandomUnderSampler “randomly” selects data from the majority class to remove, which might result in information loss as not the entire dataset is taken into modelling. That said, I’ve opted for SMOTE instead.
兩者都有其優(yōu)缺點(diǎn),但是顧名思義,RandomUnderSampler是“隨機(jī)地”從多數(shù)類(lèi)中選擇要?jiǎng)h除的數(shù)據(jù),這可能會(huì)導(dǎo)致信息丟失,因?yàn)闆](méi)有將整個(gè)數(shù)據(jù)集都納入建模之中。 也就是說(shuō),我選擇了SMOTE。
C.功能選擇 (C. Feature Selection)
What is feature selection and what are the options available for our perusal?
什么是功能選擇?可供我們仔細(xì)閱讀的選項(xiàng)有哪些?
In short, feature selection is the process of selecting the variables in the dataset which has great correlation to/ impact on the target variable. Particularly, when it comes to bigger dataset, we might face up to hundreds of features, some of which might not be relevant or even relates to the output. Hence, we need to conduct feature selection prior to modelling to achieve the highest accuracy.
簡(jiǎn)而言之,特征選擇是在數(shù)據(jù)集中選擇與目標(biāo)變量有很大相關(guān)性/影響的變量的過(guò)程。 特別是在涉及更大的數(shù)據(jù)集時(shí),我們可能會(huì)面臨多達(dá)數(shù)百個(gè)要素,其中某些要素可能與輸出不相關(guān),甚至不相關(guān)。 因此,我們需要在建模之前進(jìn)行特征選擇,以實(shí)現(xiàn)最高的精度。
Indeed, there are a handful of different techniques which can be grouped under two big categories: (1) Feature Selection and (2) Dimensionality Reduction. I believe the names sound really familiar to you, but essentially they are the same but the technique done for each is relatively different.
確實(shí),有幾種不同的技術(shù)可以分為兩大類(lèi):(1) 特征選擇和(2) 降維 。 我相信您聽(tīng)起來(lái)確實(shí)很熟悉這些名稱(chēng),但是從本質(zhì)上講它們是相同的,但是每種方法所用的技術(shù)卻相對(duì)不同。
1.特征選擇 (1. Feature Selection)
If you are looking for a complete list of techniques, feel free to refer to this blog article which lists down all possible methods for your trial. In this project, I will just apply two for the sake of simplicity: (a) Feature Importance and (b) Correlation Matrix.
如果您正在尋找完整的技術(shù)列表,請(qǐng)隨時(shí)參考此博客文章 ,其中列出了所有可能的試驗(yàn)方法。 在本項(xiàng)目中,為簡(jiǎn)單起見(jiàn),我將僅應(yīng)用兩個(gè):(a) 特征重要性和(b) 相關(guān)矩陣 。
For Feature Importance, as the name can tell, we will select the top features that have the highest correlation rate with the target variable (e.g. top 10 or 15, depending on the total number of features). Particularly, this technique deploys the attribute of ExtraTreeClassifier algorithm: feature_importances_
顧名思義,對(duì)于要素重要性 ,我們將選擇與目標(biāo)變量具有最高相關(guān)率的排名最高的要素 (例如,排名最高的10或15,具體取決于要素的總數(shù))。 特別地,此技術(shù)部署ExtraTreeClassifier算法的屬性: feature_importances_
from sklearn.ensemble import ExtraTreesClassifierfi = ExtraTreesClassifier()fi_a = fi.fit(x_a_train_rs_over, y_a_train_over)df_fi_a = pd.DataFrame(fi_a.feature_importances_,index=x_a_train_rs_over.columns)
df_fi_a.nlargest(10,df_fi_a.columns).plot(kind='barh')
plt.show().feature_importances_.feature_importances_
As you can see, annual income is the most important feature, followed by age and year of employment. Indeed, it really depends on you on the number of features to be selected.
如您所見(jiàn),年收入是最重要的特征,其次是年齡和工作年份。 確實(shí),這確實(shí)取決于您要選擇的功能數(shù)量。
Moving on to the 2nd method of Feature Selection, a Correlation Matrix is a table showing the correlation coefficients between variables in the dataset. Essentially, the higher the numbers, the more correlated it is between any two variable.
轉(zhuǎn)到特征選擇的第二種方法,“ 相關(guān)矩陣”是一個(gè)表格,顯示了數(shù)據(jù)集中變量之間的相關(guān)系數(shù)。 本質(zhì)上,數(shù)字越高,任何兩個(gè)變量之間的關(guān)聯(lián)度就越高。
Let’s see it more visually for better illustration:
讓我們更直觀(guān)地查看它,以獲得更好的插圖:
df_b_train_processed = pd.concat([x_b_train_rs_over, y_b_train_over], axis=1) #combine processed features with their targetcm_b = df_b_train_processed.corr()print(cm_b.target.sort_values().tail(10))plt.figure(figsize=(20,20))sns.heatmap(cm_b, xticklabels=df_b_train_processed.columns, yticklabels=df_b_train_processed.columns,annot=True)df.corr()df.corr()
As seen above, we only need to take into consideration the last column in the table, which is the correlation of all independent variables with the target. It looks that all features share similar coefficients with the target as of the same colour shade. This might be slightly different from (1) Feature Importance that we just ran above. However, there’s no definite right-wrong answer as each technique was designed and functions differently.
如上所示,我們只需要考慮表中的最后一列,即所有自變量與目標(biāo)的相關(guān)性。 看起來(lái),所有特征與目標(biāo)具有相同色度的系數(shù)都相似。 這可能與我們剛在上面提到的(1)功能重要性稍有不??同。 但是,由于每種技術(shù)的設(shè)計(jì)和功能都不相同,因此沒(méi)有確切的對(duì)錯(cuò)答案。
2.降維 (2. Dimensionality Reduction)
Dimensionality Reduction is basically similar to Feature Selection, but it has its own technique. The common options that I often use can be grouped as Component-based (PCA) and Projection-based (t-SNE and UMAP).
降維基本上類(lèi)似于特征選擇,但是它有自己的技術(shù)。 我經(jīng)常使用的常見(jiàn)選項(xiàng)可以分為基于組件的(PCA)和基于投影的(t-SNE和UMAP)。
Component-based (PCA): as the name tells, it’s based off on the original features in the dataset which are transformed to a new set of variables that have better correlation with the target.
基于組件(PCA) :顧名思義,它基于數(shù)據(jù)集中的原始特征,這些原始特征被轉(zhuǎn)換為與目標(biāo)具有更好相關(guān)性的一組新變量。
Projection-based (t-SNE and UMAP): the mathematical concept behind this technique is complicated, but in short, it refers to multi-dimensional data projected to a lower-dimensional space, which helps reduce the number of features in the dataset.
基于投影的(t-SNE和UMAP) :此技術(shù)背后的數(shù)學(xué)概念很復(fù)雜,但總而言之,它指的是將多維數(shù)據(jù)投影到較低維的空間,這有助于減少數(shù)據(jù)集中的特征數(shù)量。
Remember, feature scaling is required when using these techniques!
請(qǐng)記住,使用這些技術(shù)時(shí)需要特征縮放!
from sklearn.decomposition import PCApca = PCA(.95)pca_a_train = pca.fit(x_a_train_rs_over, y_a_train_over)print(pca_a_train.n_components_)plt.plot(np.cumsum(pca_a_train.explained_variance_ratio_))
plt.show()x_a_train_rs_over_pca = pd.DataFrame(pca_a_train.transform(x_a_train_rs_over))
x_a_train_rs_over_pca.head()
As for PCA, when I called the syntax, I set the explained variance of PCA to .95, which means that I was hoping to get the new set of variables that has 95% of variance from the original one. In this case, after we fit the training data to PCA, it is calculated that we only need 24 out of 46 features. Furthermore, looking at the explained_variance_ratio chart, the line stops increasing after 24 features, which could be the ideal number of features after PCA applied.
至于PCA,當(dāng)我調(diào)用語(yǔ)法時(shí),我將PCA的解釋方差設(shè)置為.95,這意味著我希望獲得一組新變量,其原始變量的方差為95%。 在這種情況下,將訓(xùn)練數(shù)據(jù)擬合到PCA之后,可以計(jì)算出我們僅需要46個(gè)特征中的24個(gè)。 此外,查看解釋性的_variance_ratio圖表,該線(xiàn)在24個(gè)特征之后停止增加,這可能是應(yīng)用PCA之后的理想特征數(shù)。
pca.fit_transform()pca.fit_transform()As for projection-based, t-SNE works well for large dataset but it’s proven to have limitations, specifically low computing time and large-scale information loss while UMAP performs better in terms of runtime while preserving information.
對(duì)于基于投影的t-SNE可以很好地用于大型數(shù)據(jù)集,但事實(shí)證明它有局限性,特別是計(jì)算時(shí)間短和信息丟失量大,而UMAP在保留信息的運(yùn)行時(shí)間方面表現(xiàn)更好。
In short, how UMAP works is that it first calculates the distance between the points in high dimensional space, projects them onto the low dimensional space, and calculates the distance between points in this low dimensional space. Then, it uses Stochastic Gradient Descent to minimize the difference between these distances.
簡(jiǎn)而言之,UMAP的工作原理是,它首先計(jì)算高維空間中的點(diǎn)之間的距離,將它們投影到低維空間中,然后計(jì)算該低維空間中的點(diǎn)之間的距離。 然后,它使用隨機(jī)梯度下降來(lái)最小化這些距離之間的差異。
import umapum = umap.UMAP(n_components=24)umap_a_train = um.fit_transform(x_a_train_rs_over)x_a_train_rs_over_umap = pd.DataFrame(umap_a_train)
x_a_train_rs_over_umap.head()umap.UMAP.fit_transform()umap.UMAP.fit_transform()
To compare between PCA and UMAP in terms of performance, it depends on the scale and the complex of our dataset so as to determine which to be utilized. However, for the sake of simplicity and better runtime, I’ve opted for PCA to be applied across datasets and leverage for the modelling phase.
要在性能方面比較PCA和UMAP,這取決于我們數(shù)據(jù)集的規(guī)模和復(fù)雜程度,從而確定要使用哪個(gè)。 但是,為了簡(jiǎn)單起見(jiàn)和更好的運(yùn)行時(shí)間,我選擇將PCA應(yīng)用于數(shù)據(jù)集并在建模階段利用杠桿作用。
Voila! We’re done with the 2nd part of this end-to-end project, with the concentration on Feature Scaling and Selection!
瞧! 我們已經(jīng)完成了此端到端項(xiàng)目的第二部分,重點(diǎn)是功能縮放和選擇!
I do hope you find it informative and easy to follow, so feel free to leave comments here! Do look out for the 3rd final part of this project which utilizes all of the data preparation steps to apply Machine Learning algorithms.
我希望您能從中找到有用且易于理解的信息,因此隨時(shí)在此處發(fā)表評(píng)論! 請(qǐng)注意本項(xiàng)目的第三部分,該部分利用所有數(shù)據(jù)準(zhǔn)備步驟來(lái)應(yīng)用機(jī)器學(xué)習(xí)算法 。
In the meantime, let’s connect:
同時(shí),讓我們連接:
Github: https://github.com/andrewnguyen07LinkedIn: www.linkedin.com/in/andrewnguyen07
GitHub: https : //github.com/andrewnguyen07 LinkedIn: www.linkedin.com/in/andrewnguyen07
Thanks!
謝謝!
翻譯自: https://towardsdatascience.com/credit-risk-management-feature-scaling-selection-b734049867ea
c盤(pán)擴(kuò)展卷功能只能向右擴(kuò)展
總結(jié)
以上是生活随笔為你收集整理的c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 用camelot读取表格_如何使用Cam
- 下一篇: java中的八大基本数据类型是什么