Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

代码有些问题 #27

Open
wjy3326 opened this issue Apr 27, 2022 · 6 comments
Open

代码有些问题 #27

wjy3326 opened this issue Apr 27, 2022 · 6 comments

Comments

@wjy3326
Copy link

wjy3326 commented Apr 27, 2022

def get_local_word2entity(entities): # 格式:entity_id: entity
"""
Given the entities information in one line of the dataset, construct a map from word to entity index
E.g., given entities = 'id_1:Harry Potter;id_2:England', return a map = {'harry':index_of(id_1),
'potter':index_of(id_1), 'england': index_of(id_2)}
:param entities: entities information in one line of the dataset
:return: a local map from word to entity index
"""
local_map = {}

for entity_pair in entities.split(';'):
    entity_id = entity_pair[:entity_pair.index(':')]
    entity_name = entity_pair[entity_pair.index(':') + 1:]

    # remove non-character word and transform words to lower case
    entity_name = PATTERN1.sub(' ', entity_name)
    entity_name = PATTERN2.sub(' ', entity_name).lower()
    # constructing map: word -> entity_index
    for w in entity_name.split(' '):
        entity_index = entity2index[entity_id]  
        local_map[w] = entity_index  # 这里有问题,不同的实体如果有相同的词,就覆盖了之前的entity_index,不知道这里什么意思?

return local_map
@hwwang55
Copy link
Owner

您好,确实存在您说的可能性,但是实际情况下几乎不会出现这种情况。感谢!

@wjy3326
Copy link
Author

wjy3326 commented Apr 27, 2022

我觉得如果实体数量足够大,是经常可能发生这个情况的,毕竟常用词语就那么几千个,但是实体有很多,不知道这个函数有什么作用?

@hwwang55
Copy link
Owner

这个计算是针对每个新闻标题进行的,很难会出现在一个新闻标题中,同一个词出现在多个实体中的情况。这个函数是计算每个词属于哪一个实体,为了后续的convolution操作。

@wjy3326
Copy link
Author

wjy3326 commented Apr 27, 2022

了解了,谢谢

@wjy3326
Copy link
Author

wjy3326 commented Apr 27, 2022

问下1. word_embedding跟entity_embedding拼接那块,是entity中的所有字的embedding都是entity_embedding吗?因为entity_embedding是针对整个实体,是个词语,拼接是按照字拼接对吧?这里的entity_id (0,0,0,0,3533,3533,3533,0,0,)里的三个字的entity_embedding都是一样的吧?
2. 训练最后为什么用sigmoid,不用softmax呢?

@wjy3326
Copy link
Author

wjy3326 commented Apr 27, 2022

  1. sigmoid设置的阈值是多少呢?0.5吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants