Python统计序列中元素的频度

1, 使用最容易想到的办法

In [1]: from random import randint

In [2]: data = [randint(0, 7) for _ in range(20)]              //*生成带有重复元素的随机列表

In [3]: data
Out[3]: [0, 5, 2, 3, 3, 1, 0, 4, 2, 4, 6, 6, 0, 7, 2, 0, 4, 3, 2, 7]

In [4]: d = dict.fromkeys(data, 0)                            //*使用dict.fromkeys()方法返回一个新字典,键为调用的data元素,值设置为0

In [5]: d
Out[5]: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0}      //*可以看到重复的元素已经被整合在一起了

In [6]: for i in data:                                        //*遍历data字典里的元素
   ...:     d[i] += 1                                         //*每次遇到一个元素就在字典对应的值上+1
   ...: 

In [7]: d
Out[7]: {0: 4, 1: 1, 2: 4, 3: 3, 4: 3, 5: 1, 6: 2, 7: 2}      //*打印统计好的元素频度

In [8]: sorted(((v, k) for k, v in d.items()), reverse=True)  //*将元素按照频度进行排序
Out[8]: [(4, 2), (4, 0), (3, 4), (3, 3), (2, 7), (2, 6), (1, 5), (1, 1)]

In [9]: sorted(((v, k) for k, v in d.items()), reverse=True)[:3]
Out[9]: [(4, 2), (4, 0), (3, 4)]                              //*打印频度最高的前三个元素

2, 使用标准库collections中的Counter对象

In [1]: from random import randint

In [2]: from collections import Counter

In [3]: data = [randint(0, 7) for _ in range(15)]

In [4]: data
Out[4]: [1, 7, 7, 0, 2, 4, 4, 5, 0, 6, 1, 3, 1, 7, 7]

In [5]: Counter(data)                    //*使用Counter()方法直接统计出元素的频度
Out[5]: Counter({7: 4, 1: 3, 0: 2, 4: 2, 2: 1, 3: 1, 5: 1, 6: 1})

In [6]: Counter(data).most_common(3)    //*打印频度最高的前三个元素
Out[6]: [(7, 4), (1, 3), (0, 2)]

3, 统计文章中出现词频最多的10个单词

In [1]: import re                           //*导入正则表达式的re模块

In [2]: from collections import Counter

In [3]: f = open('pindu.txt', 'r').read()   //*以只读的方式打开一个存有很多单词的文件

In [4]: f
Out[4]: 'It\'s faster horses,\nYounger women,\nOlder whiskey and\nMore money.\n\t\t--Tom T. Hall, "The Secret of Life"\nI\'d love to kiss you, but I just washed my hair.\n\t\t-- Bette Davis, "Cabin in the Cotton"\nThe things that interest people most are usually none of their business.\nBlack Holes:\n\tAn X generation subgroup best known for their possession of\nalmost entirely black wardrobes.\n\t\t-- Douglas Coupland, "Generation X: Tales for an Accelerated\n\t\t   Culture"\nVendor no longer supports the product\n"Interesting survey in the current Journal of Abnormal Psychology: New York \nCity has a higher percentage of people you shouldn\'t make any sudden moves \naround than any other city in the world."\n-- David Letterman\nISO applications:\n\tA solution in search of a problem!\nNothing is ever a total loss; it can always serve as a bad example.\nWhat this country needs is a good five cent microcomputer.\nYou will be the victim of a bizarre joke.\n"Turn on, tune up, rock out."\n-- Billy Gibbons\nI had a feeling once about mathematics -- that I saw it all.  Depth beyond\ndepth was revealed to me -- the Byss and the Abyss. I saw -- as one might\nsee the transit of Venus or even the Lord Mayor\'s Show -- a quantity passing\nthrough infinity and changing its sign from plus to minus.  I saw exactly\nwhy it happened and why tergiversation was inevitable -- but it was after\ndinner and I let it go.\n\t\t-- Winston Churchill\n"90% of everything is crap", Its called Sturgeon\'s law 8)                     \nOne of the problems is indeed finding the good bits\n\n\t- Alan Cox\n"What man has done, man can aspire to do."\n-- Jerry Pournelle, about space flight\nIn a gathering of two or more people, when a lighted cigarette is\nplaced in an ashtray, the smoke will waft into the face of the non-smoker.\nIt is impossible for an optimist to be pleasantly surprised.\nAnd the crowd was stilled.  One elderly man, wondering at the sudden silence,\nturned tothe Child and asked him to repeat what he had said.  Wide-eyed,\nthe Child raised hisvoice and said once again, "Why, the Emperor has no\nclothes!  He is naked!"\n- "The Emperor\'s New Clothes"\n'

In [5]: word_list = re.split('\W+', f)      //*使用正则表达式的split分割单词,‘\W+’表示一个或多个特殊字符,即非字母、非数字、非汉字、非_

In [6]: Counter(word_list).most_common(10)  //*统计出词频最高的10个单词
Out[6]:
[('the', 19),
 ('of', 12),
 ('a', 10),
 ('to', 7),
 ('is', 7),
 ('and', 7),
 ('I', 7),
 ('it', 5),
 ('in', 5),
 ('s', 4)]