机器学习 美股_我如何使用机器学习来探索英美文学之间的差异
機(jī)器學(xué)習(xí) 美股
by Sofia Godovykh
索非亞·戈多維克(Sofia Godovykh)
我如何使用機(jī)器學(xué)習(xí)來(lái)探索英美文學(xué)之間的差異 (How I used machine learning to explore the differences between British and American literature)
As I delved further into English literature to further my own language gains, my interest was piqued: how do American and British English differ?
當(dāng)我進(jìn)一步研究英語(yǔ)文學(xué)以提高自己的語(yǔ)言水平時(shí),我的興趣激起了:美國(guó)英語(yǔ)和英國(guó)英語(yǔ)有何不同?
With this question framed in my mind, the next steps were to apply natural language processing and machine learning techniques to find concrete examples. I was curious to know whether it would be possible to train a classifier, which would distinguish literary texts.
考慮到這個(gè)問(wèn)題,下一步是應(yīng)用自然語(yǔ)言處理和機(jī)器學(xué)習(xí)技術(shù)來(lái)找到具體的例子。 我很想知道是否有可能訓(xùn)練分類器來(lái)區(qū)分文學(xué)文本。
It is quite easy to distinguish texts written in different languages since the cardinality of the intersection of words (features, in terms of machine learning) was relatively small. Text classification by category (such as science, atheism, computer graphics, etc.) is a well-known “hello world” when it comes to tasks related with working with text classification. I faced a more difficult task when I tried to compare two dialects of the same language, as texts have no common theme.
由于單詞交集的基數(shù)(相對(duì)于機(jī)器學(xué)習(xí)而言的特征)相對(duì)較小,因此區(qū)分以不同語(yǔ)言編寫(xiě)的文本非常容易。 當(dāng)涉及與文本分類相關(guān)的任務(wù)時(shí),按類別(例如科學(xué),無(wú)神論,計(jì)算機(jī)圖形學(xué)等)進(jìn)行文本分類是眾所周知的“ hello world”。 當(dāng)我嘗試比較同一種語(yǔ)言的兩種方言時(shí),我面臨著更加艱巨的任務(wù),因?yàn)槲谋緵](méi)有共同的主題。
The most time consuming stage of machine learning deals with the retrieval of data. For the training sample, I used texts from Project Gutenberg, which can be freely downloaded. As for the list of American and British authors, I used names of authors I found in the Wikipedia.
機(jī)器學(xué)習(xí)最耗時(shí)的階段是數(shù)據(jù)的檢索。 對(duì)于培訓(xùn)樣本,我使用了來(lái)自Gutenberg項(xiàng)目的文本,可以免費(fèi)下載。 至于美國(guó)和英國(guó)作者的名單,我使用了我在維基百科上找到的作者的名字。
One of the challenges I encountered was finding the name of the author of a text that matched the Wikipedia page. A good search by name was implemented on the site, but since the site doesn’t allow the parsing of data, I instead proposed to use files that contained metadata. This meant that I needed to solve a non-trivial task of matching names (Sir Arthur Ignatius Conan Doyle and Doyle, C. is the same person, but Doyle, M.E. is a different person) — and I had to do so with a very high level of accuracy.
我遇到的挑戰(zhàn)之一是找到與Wikipedia頁(yè)面匹配的文本的作者姓名。 在站點(diǎn)上實(shí)現(xiàn)了按名稱的良好搜索,但是由于站點(diǎn)不允許解析數(shù)據(jù),因此我建議使用包含元數(shù)據(jù)的文件。 這意味著我需要解決一個(gè)簡(jiǎn)單的姓名匹配任務(wù)(Sir Arthur Ignatius Conan Doyle和C. Doyle是同一個(gè)人,而ME。Doyle是不同的人),而我必須非常高精度。
Instead, I chose to sacrifice the sample size for the sake of attaining high accuracy, as well as saving some time. I chose, as a unique identifier, an author’s Wikipedia link, which was included in some of the metadata files. With these files, I was able to acquire about 1,600 British and 2,500 American texts and use them to begin training my classifier.
取而代之的是,我選擇犧牲樣本大小以達(dá)到高精度,同時(shí)節(jié)省一些時(shí)間。 我選擇了作者的Wikipedia鏈接作為唯一標(biāo)識(shí)符,該鏈接包含在某些元數(shù)據(jù)文件中。 有了這些文件,我就可以獲取約1,600份英國(guó)文本和2500份美國(guó)文本,并使用它們開(kāi)始訓(xùn)練我的分類器。
For this project I used sklearn package. The first step after the data collection and analysis stage is pre-processing, in which I utilized a CountVectorizer. A CountVecrorizer takes a text data as input and returns a vector of features as output. Next, I needed to calculate the tf-idf (term frequency — inverted document frequency). A brief explanation why I needed to use it and how:
對(duì)于這個(gè)項(xiàng)目,我使用了sklearn包。 數(shù)據(jù)收集和分析階段之后的第一步是預(yù)處理,其中我使用了CountVectorizer。 CountVecrorizer將文本數(shù)據(jù)作為輸入,并返回特征向量作為輸出。 接下來(lái),我需要計(jì)算tf-idf (術(shù)語(yǔ)頻率-倒排文檔頻率)。 簡(jiǎn)要說(shuō)明為什么需要使用它以及如何使用:
For example, take the word “the” and count the number of occurrences of the word in a given text, A. Let’s suppose that we have 100 occurrences, and the total number of words in a document is 1000.
例如,取單詞“ the”并計(jì)算給定文本A中該單詞的出現(xiàn)次數(shù)。假設(shè)我們有100個(gè)出現(xiàn)次數(shù),而文檔中的單詞總數(shù)為1000。
Thus,
從而,
tf(“the”) = 100/1000 = 0.1
tf(“the”) = 100/1000 = 0.1
Next, take the word “sepal”, which has occurred 50 times:
接下來(lái),使用“ sepal”一詞,該詞已經(jīng)出現(xiàn)了50次:
tf(“sepal”) = 50/1000 = 0.05
tf(“sepal”) = 50/1000 = 0.05
To calculate the inverted document frequency for these words, we need to take the logarithm of the ratio of the number of texts from which there is at least one occurrence of the word, to the total number of texts. If there are all 10,000 texts, and in each, there is the word “the”:
要計(jì)算這些單詞的倒排文檔頻率,我們需要取至少出現(xiàn)一次單詞的文本數(shù)與文本總數(shù)之比的對(duì)數(shù)。 如果總共有10,000個(gè)文本,并且每個(gè)文本中都有單詞“ the”:
idf(“the”) = log(10000/10000) = 0 and
idf(“the”) = log(10000/10000) = 0且
tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0
tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0
The word “sepal” is way more rare, and was found only in the 5 texts. Therefore:
“ sepal”一詞更為罕見(jiàn),僅在5個(gè)文本中才發(fā)現(xiàn)。 因此:
idf(“sepal”) = log(10000/5) and tf-idf(“sepal”) = 7.6 * 0.05 = 0.38
idf(“sepal”) = log(10000/5) and tf-idf(“sepal”) = 7.6 * 0.05 = 0.38
Thus, the most frequently occurring words carry less weight, and specific rarer ones carry more weight. If there are many occurrences of the word “sepal”, we can assume that this is a botanical text. We can not feed a classifier with words, we will use tf-idf measure instead.
因此,最常出現(xiàn)的單詞的權(quán)重較小,而特定的罕見(jiàn)單詞的權(quán)重較大。 如果出現(xiàn)“ sepal”一詞的次數(shù)很多,我們可以假定這是植物性文本。 我們無(wú)法用單詞來(lái)填充分類器,我們將改用tf-idf度量。
After I had presented the data as a set of features, I needed to train the classifier. I was working with text data, which is presented as sparse data, so the best option is to use a linear classifier, which works well with large amounts of features.
在將數(shù)據(jù)呈現(xiàn)為一組功能之后,我需要訓(xùn)練分類器。 我正在處理以稀疏數(shù)據(jù)形式表示的文本數(shù)據(jù),因此最好的選擇是使用線性分類器,該分類器可以很好地與大量功能配合使用。
First, I ran the CountVectorizer, TF-IDFTransformer and SGDClassifier using the default parameters. By analyzing the plot of the accuracy of the sample size — where accuracy fluctuated from 0.6 to 0.85 — I discovered that the classifier was very much dependent on the particular sample used, and therefore not very effective.
首先,我使用默認(rèn)參數(shù)運(yùn)行CountVectorizer,TF-IDFTransformer和SGDClassifier。 通過(guò)分析樣本大小的精度圖(精度從0.6到0.85波動(dòng)),我發(fā)現(xiàn)分類器在很大程度上取決于所使用的特定樣本,因此效果不佳。
After receiving a list of the classifier weights, I noticed a part of the problem: the classifier had been fed with words like “of” and “he”, which we should have treated as a noise. I could easily solve this problem by removing these words from the features by setting the stop_words parameter to the CountVectorizer: stop_words = ‘english’ (or your own custom list of stop words).
在收到分類器權(quán)重列表之后,我注意到了問(wèn)題的一部分:分類器被喂了“ of”和“ he”之類的詞,我們應(yīng)該將其視為雜音。 我可以通過(guò)將stop_words參數(shù)設(shè)置為stop_words從功能中刪除這些單詞來(lái)輕松解決此問(wèn)題: stop_words = 'english' (或您自己的自定義停用詞列表)。
With the default stop words removed, I got an accuracy of 0.85. After that, I launched the automatic selection of parameters using GridSearchCV and achieved a final accuracy of 0.89. I may be able to improve this result with a larger training sample, but for now I stuck with this classifier.
刪除默認(rèn)停用詞后,我的準(zhǔn)確度為0.85。 之后,我使用GridSearchCV啟動(dòng)了參數(shù)的自動(dòng)選擇,最終精度達(dá)到了0.89。 我可能可以通過(guò)使用更大的訓(xùn)練樣本來(lái)改善此結(jié)果,但是現(xiàn)在我堅(jiān)持使用該分類器。
Now on to what interests me most: which words point to the origin of the text? Here’s a list of words, sorted in descending order of weight in the classifier:
現(xiàn)在,我最感興趣的是:哪些詞指向文本的起源? 這是單詞列表,在分類器中按權(quán)重降序排列:
American: dollars, new, york, girl, gray, american, carvel, color, city, ain, long, just, parlor, boston, honor, washington, home, labor, got, finally, maybe, hodder, forever, dorothy, dr
美國(guó):美元,新,約克,女孩,灰色,美國(guó),carvel,顏色,城市,艾因,長(zhǎng),只是,客廳,波士頓,榮譽(yù),華盛頓,家庭,勞工,終于有了,也許是霍德,永遠(yuǎn),多蘿西,博士
British: round, sir, lady, london, quite, mr, shall, lord, grey, dear, honour, having, philip, poor, pounds, scrooge, soames, things, sea, man, end, come, colour, illustration, english, learnt
英國(guó)人:圓形,先生,女士,倫敦,相當(dāng),先生,須,領(lǐng)主,灰色,親愛(ài)的,榮譽(yù),有,菲利普,可憐,磅,史克魯奇,蘇打,東西,海,人,端,來(lái),顏色,插圖,英語(yǔ),學(xué)習(xí)
While having fun with the classifier, I was able to single-out the most “American” British authors and the most “British” American authors (a tricky way to see how bad my classifier could work).
在與分類器一起玩耍的同時(shí),我能夠挑選出最“美國(guó)”的英國(guó)作者和最“英國(guó)”的美國(guó)作者(這是一種棘手的方法,可以看出我的分類器的工作效果如何)。
The most “British” Americans:
最“英國(guó)”的美國(guó)人:
- Frances Hodgson Burnett (born in England, moved to the USA at age of 17, so I treat her as an American writer) 弗朗西斯·霍奇森·伯內(nèi)特(Frances Hodgson Burnett)(出生于英國(guó),在17歲時(shí)移居美國(guó),所以我將她視為美國(guó)作家)
- Henry James (born in the USA, moved to England at age of 33) 亨利·詹姆斯(Henry James)(出生于美國(guó),現(xiàn)年33歲,移居英國(guó))
- Owen Wister (yes, the father of Western fiction) Owen Wister(是,西方小說(shuō)之父)
- Mary Roberts Rinehart (was called the American Agatha Christie for a reason) 瑪麗·羅伯茨·雷內(nèi)哈特(Mary Roberts Rinehart)(由于某種原因被稱為美國(guó)阿加莎·克里斯蒂)
- William McFee (another writer moved to America at a young age) 威廉·麥克菲(另一位作家年輕時(shí)移居美國(guó))
The most “American” British:
最“美國(guó)”的英國(guó)人:
- Rudyard Kipling (he lived in America several years, also, he wrote “American Notes”) 魯?shù)聛喌隆ぜ妨?他在美國(guó)生活了幾年,也寫(xiě)了《美國(guó)筆記》)
- Anthony Trollope (the author of “North America”) 安東尼·特羅洛普(Anthony Trollope)(《北美》的作者)
- Frederick Marryat (A veteran of Anglo-American War of 1812, thanks to his “Narrative of the Travels and Adventures of Monsieur Violet in California, Sonara, and Western Texas” which made him fall into the american category) 弗雷德里克·馬里亞特(Frederick Marryat)(1812年英美戰(zhàn)爭(zhēng)的退伍軍人,這要?dú)w功于他的“加利福尼亞,索納拉和西得克薩斯州的紫羅蘭先生游記和歷險(xiǎn)記”,使他進(jìn)入了美國(guó)類別)
- Arnold Bennett (the author of “Your United States: Impressions of a first visit”) one more gentleman wrote travel notes 阿諾德·貝內(nèi)特(Arnold Bennett)(《您的美國(guó):第一次訪問(wèn)的印象》的作者)又一位先生寫(xiě)了旅行記錄
- E. Phillips Oppenheim 菲利普斯·奧本海姆
And also the most “British” British and “American” American authors (because the classifier still works well):
也是最“英國(guó)”的英國(guó)和“美國(guó)”美國(guó)作者(因?yàn)榉诸惼魅匀挥行?:
Americans:
美國(guó)人:
- Francis Hopkinson Smith 弗朗西斯·霍普金森·史密斯
- Hamlin Garland 哈姆林·加蘭
- George Ade 喬治·阿德
- Charles Dudley Warner 查爾斯·達(dá)德利·華納
- Mark Twain 馬克·吐溫
British:
英國(guó)人:
- George Meredith 喬治·梅雷迪思
- Samuel Richardson 塞繆爾·理查森(Samuel Richardson)
- John Galsworthy 約翰·加爾斯沃西
- Gilbert Keith Chesterton 吉爾伯特·基思·切斯特頓
- Anthony Trollope (oh, hi) 安東尼·特羅洛普(哦,嗨)
I was inspired to do this work by @TragicAllyHere tweet:
@TragicAllyHere啟發(fā)了我從事這項(xiàng)工作:
Well, wourds really matter, as I realised.
嗯,就像我意識(shí)到的那樣,絲瓜真的很重要。
翻譯自: https://www.freecodecamp.org/news/how-to-differentiate-between-british-and-american-literature-being-a-machine-learning-engineer-ac842662da1c/
機(jī)器學(xué)習(xí) 美股
總結(jié)
以上是生活随笔為你收集整理的机器学习 美股_我如何使用机器学习来探索英美文学之间的差异的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 如何做梦梦到一个人
- 下一篇: 梦到车被拖走了是什么意思