Wayne

Calendar

November
27	28	29	30	31	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

NLP随手笔记

Wayne posted @ Mon, 19 Mar 2012 14:56:07 +0000 in Notes , 3555 readers

虽然说记笔记应该有更好的方式，不过懒得折腾了。先在这随便记下吧。

Lemma, Wordform ：一句话中的WORD可以用两种不同的标准来区分。一种是Lemma，一种是wordform。 wordform就是词的形状，而lemma则是词意。比如 am is are ，都是一个lemma，但是3个wordform。

Type, Token ：倘若以wordform的形式来界定一个词，那么一句话中WORD的数目还可以用两种不同的标准来区分。Type是相同的词都算一个，Token是每个词出现几次都算。所以 "no no no .... it is not possible" 这样的一句话，Type 有5个，Token 有7个。

token常用N来表示，(number of tokens)， type 常用V来表示，(|V| 就是 size of vocabulary)，Church and Gale 1990年有个统计公式 |V| > O(N^1/2)

这里它还展示了一个 Tokenization, 统计莎士比亚作品中的Token数。 tr 这个命令看上去不错，虽然依云说给出的指令组合比较低效，不如awk版的。但是我问他awk版的长什么样，他让我自己去翻wiki……

(note: 翻了下wiki，在词频统计这里，发现了Awk的方法：

BEGIN {
    FS="[^a-zA-Z]+"
}
{
     for (i=1; i<=NF; i++)
          words[tolower($i)]++
}
END {
    for (i in words)
         print i, words[i]
}

这个可能效率上会快些，不过我也没实际time过。但是思路上完全就是两个不同的方向，更类似于专门来解决一个词频统计问题，而不是在遇到一堆文字时顺手所做。tr 的这个结合了管道和多钟工具的方法在思路上就自然多了 )

简单的通过正则匹配的 Tokenization 方式还存在很多问题，比如一些词不好切分，一些词中间带了杂七杂八的符号但却又是同一个。更不用说中文分词这个东西了。关于中文分词，这边认为，使用Greedy的方式来匹配很有效，不过同样的方法对英文又不适用。

Normalization，就是把各种形状的词，按意义统一起来。实际上就是找Lemma。或者说，Lemmatization是一个很好的Normalization的方法。把am is are 都转换成be，那自然就normal了。具体Lemmatization的方法很复杂，不过也有些可以解决部分问题的手段。比如说对于那些一个主干加了不同前缀后缀的词，就可以去掉前后缀，抓住主干。主干叫Stem，各种缀叫Affix。对于英文，有个Porter's algorithm 用于处理各种Affix。可惜这个办法没有任何通用性，而且容易找到例外。

Sentence Segmentation, 就是断句。这里讲了一些常用的断句方法，不过我觉得只适用于严肃的英文文本中。它主要基于标点和大小写来判断。通过“决策树”的方式。决策树是一堆嵌套的if ... else，这个目测就知道效率不会高。所以实际中也会用到不少其他方法。决策树的效果直接取决于判断条件，也就是features，这里需要记住几个看上去比较有技术含量的Numeric features:

Length of word with "."

Probability(word with "." occurs at end-of-s)

Probability(word after "." occurs at beginning-of-s)

Comments (17)

[Reply]

How to Change Mobile said:
Mon, 14 Nov 2022 05:08:06 +0000

Aadhaar card is a unique card that holds all the biometric information of an individual. The Unique Identification Authority of India issues the card to all Indian citizens. The 12-digit card works as an identification and address proof. How to Change Mobile Number in Aadhar Card without otp The card is used by banks, telecom companies, the public distribution system, and the income tax department as an identity verification card.

[Reply]

apple carplay ne fon said:
Mon, 17 Jul 2023 15:39:08 +0000

Lorsque l’utilisation de votre téléphone pendant la conduite est restreinte dans de nombreux pays pour des raisons de sécurité claires, Apple a créé CarPlay, qui peut acheminer les applications et les médias vers le système de navigation de votre voiture, apple carplay ne fonctionne pas ce qui facilite l’accès aux applications mains libres. Les iPhones commencent rapidement à charger, cependant, Carplay ne fonctionne pas, ce qui est un problème assez courant.

[Reply]

pourquoi apple carpl said:
Mon, 17 Jul 2023 15:52:52 +0000

Lorsque l’utilisation de votre téléphone pendant la conduite est restreinte dans de nombreux pays pour des raisons de sécurité claires, Apple a créé CarPlay, qui peut acheminer les applications et les médias vers le système de navigation de votre voiture, pourquoi apple carplay ne fonctionne pas ce qui facilite l’accès aux applications mains libres. Les iPhones commencent rapidement à charger, cependant, Carplay ne fonctionne pas, ce qui est un problème assez courant.

[Reply]

CBSE 7th Class Syll said:
Sat, 02 Sep 2023 10:27:29 +0000

NCERT new Syllabus 2024 for Class 7 Contains Complete Information about Curriculum, course Structure Exam Pattern etc. The information Available in CBSE Class 7th new Syllabus 2024 is important for the Preparation CBSE 7th Class Syllabus 2024 of CBSE Class 7th Board Exams 2024.NCERT 7th New Revised Syllabus 2024 Chapter Wise Pdf Download.Online Service Covers All Subjects Published by NCERT Syllabus 2024 for Standard VII in Hindi, English and Urdu medium Chapter Wise Pdf Format, our Website Providing the latest NCERT 7th Syllabus 2024 for all Important Subjects.

[Reply]

I9BET said:
Sat, 13 Jul 2024 00:56:29 +0000

I9BET duoc menh danh la mot trong nhung san choi cung cap cac dich vu giai tri chuyen nghiep va uy tin. Tham gia tai I9BET cong dong game thu se duoc trai nghiem mot khong gian giai tri dat chuan dang cap quoc te.https://i9bet-vn.net/

[Reply]

S666 said:
Wed, 17 Jul 2024 09:45:49 +0000

I'm impressed by the depth of research and clear evidence you've included in your article. S666 https://s666-vn.org/

[Reply]

S666 said:
Wed, 17 Jul 2024 09:46:39 +0000

<a href="https://s666-vn.org/" target="_blank"> S666</a> chinh thuc hoat dong Ltu nam 2012 va la san pham cua thuong hieu First Cagayan, Philippines. Phuong cham ma nha cai luon chu trong la dat loi ich cua thanh vien len tren het. Voi hon 10 nam ton tai,da co cho dung nhat dinh trong cong dong cuoc thu chuyen nghiep <a href="https://s666-vn.org/" target="_blank"> https://s666-vn.org/</a>

[Reply]

S666 said:
Wed, 17 Jul 2024 09:47:59 +0000

S666 chinh thuc hoat dong Ltu nam 2012 va la san pham cua thuong hieu First Cagayan, Philippines. Phuong cham ma nha cai luon chu trong la dat loi ich cua thanh vien len tren het. Voi hon 10 nam ton tai,da co cho dung nhat dinh trong cong dong cuoc thu chuyen nghiep https://s666-vn.org/

[Reply]

Sunwin said:
Mon, 02 Sep 2024 02:41:17 +0000

Sunwin là cổng game bài đổi thưởng uy tín số 1 châu Á trong nhiều năm qua. Các trò chơi hiện đại, đẳng cấp được thiết kế công phu và đã trải qua nhiều lần nâng cấp. Đa số người dân Việt Nam đều gắn bó với cổng game lớn này vì nhiều lý do, trong đó điều quan trọng nhất đó là các trò chơi dễ, và nạp rút tiền rất là nhanh.
https://sunwin90.vip/

[Reply]

jun88 said:
Thu, 19 Sep 2024 07:10:20 +0000

Jun88 is currently the number 1 reputable betting brand in Vietnam. This house currently has a huge number of members up to 10 million people and this number continues to increase every day. However, not everyone understands the reputation and advantages that Jun88 possesses. https://www.jun88tc.com/

[Reply]

OKVIP said:
Sat, 21 Sep 2024 09:47:37 +0000

OKVIP là liên minh giải trí trực tuyến hàng đầu châu Á, cung cấp các trò chơi chất lượng cao với đồ họa đẹp và dịch vụ hỗ trợ 24/7. Hệ sinh thái đa dạng cùng cam kết bảo mật tuyệt đối, giúp OKVIP trở thành điểm đến giải trí uy tín cho người dùng. https://139.59.222.230/

[Reply]

Hello88 651 said:
Thu, 17 Oct 2024 13:53:49 +0000

Hello88 la nha cai truc tuyen hang dau tai Viet Nam, chuyen cung cap cac dich vu ca cuoc da dang nhu the thao, slot game, va ban ca. ttthebears.com

[Reply]

White Screen said:
Fri, 25 Oct 2024 03:21:00 +0000

Looking for a way to highlight your images, videos, or designs? White Screen is the ultimate solution! With its bright and pure white tone, White Screen creates the perfect backdrop for all your creative projects. Don’t let your content get lost in the background—let White Screen make every detail stand out! Try it today and experience the difference!

[Reply]

188Bet said:
Wed, 06 Nov 2024 08:38:16 +0000

188Bet la nha cai uy tin hang dau chau A voi gan 20 nam hoat dong tren thi truong. 188 Bet cung cap nhung san pham va dich vu ca cuoc hang dau, da dang the loai game nhu ca cuoc the thao, lo de online. Website: https://188betct.com

[Reply]

tải go88 said:
Sun, 10 Nov 2024 06:34:18 +0000

tải go88 chơi ngay, go 88 là cổng game bài đổi thưởng uy tín hàng đầu với dịch vụ chất lượng cao và giao diện thân thiện. Tại Go88, người chơi có thể trải nghiệm nhiều trò chơi hấp dẫn và nhận được các ưu đãi đặc biệt. Với công nghệ hiện đại và bảo mật thông tin tối đa, Go88 mang đến môi trường giải trí an toàn và minh bạch, đảm bảo trải nghiệm tốt nhất cho người dùng trong năm 2024.

[Reply]

123B said:
Mon, 25 Nov 2024 07:24:25 +0000

123B la trang nha cai uy tin hang dau trong linh vuc giai tri truc tuyen. Chung toi tu hao mang den cho nguoi choi mot san choi an toan. Website: https://skylock.cc/

[Reply]

mibet đăng nhập said:
Sat, 25 Jan 2025 07:15:35 +0000

Khẳng định đẳng cấp tại MIBET, nhà cái minh bạch từ 2008, được bong88 cấp phép. Gõ mibet đăng nhập nhận 199k và tham gia ngay!

November
Sun	Mon	Tue	Wed	Thu	Fri	Sat
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

November
Sun	Mon	Tue	Wed	Thu	Fri	Sat
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

November
Sun	Mon	Tue	Wed	Thu	Fri	Sat
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30