Final project of Information retrieval course.
TPNT is a Tag cloud generator that extracts hot keywords from Twitter page of a major Persian news agency in the fields of Economics and Socials for each month in a year.
GetOldTweets-java v1.2.0
Lucene 7.2.1
- Tasnim News(@TasnimNews_Fa)
This project has to main steps. First, twitts are stored in a csv
file with the help of Crawler
class. this class needs some options to work properly:
Flag | Desc | Requisition |
---|---|---|
-i |
The Id of twitter page | required |
-s |
Start date of extraction, format: YYY-MM-DD |
required |
-e |
End date of extraction, format: YYY-MM-DD |
no |
-m |
Limitation in the number of retrieved twitts | no |
-p |
Path of csv file | no |
-n |
Name of csv file | no |
An example for retrieving twitts from (@TasnimNews_Fa) starting from 2018-06-01 to 2018-07-01 in $PWD/result/
path:
java -cp ProjectNews.jar ir.ac.um.ce.projectnews.crawler.Crawler -i Tasnimnews_Fa -s 2018-06-01 -e 2018-07-01 -p result/
The next step is indexing docs. After removing stop-words from docs we use Searcher
and Classifier
classes plus a Bag of word to create some queries to estimate the correlation of each doc with context. Finally, we use the most corrolated words to generate a tag clud.