|
National Polytechnic
Institute
|
|
Demo 3.1
User Manual
Grigori Sidorov
Mexico City
1999
Contents
1. Welcometo Classifier................................................................................................................................ 3
1.1 The main screen............................................................................................................................................ 3
1.2 How Do I.............................................................................................................................................................. 3
2. Screen Elements............................................................................................................................................... 4
2.1 Document page.............................................................................................................................................. 4
2.1.1 Files list (Document page)........................................................................................................................... 4
2.1.2 Topic tree (Document page)......................................................................................................................... 4
2.1.3 Results sheet (Document page).................................................................................................................... 4
2.2 By Topic page................................................................................................................................................... 6
2.2.1 Topic tree (By Topic page)........................................................................................................................... 6
2.2.2 Results sheet (By Topic page)...................................................................................................................... 7
2.3 By Document page........................................................................................................................................ 8
2.3.1 Files list (By Document page)...................................................................................................................... 8
2.3.2 Topic tree (By Document page)................................................................................................................... 8
2.3.3 Results sheet (By Document page).............................................................................................................. 9
2.4 Dictionary page.......................................................................................................................................... 10
2.4.1 Topic tree (Dictionary page)..................................................................................................................... 11
2.4.2 Words list (Dictionary page)..................................................................................................................... 11
2.4.3 Language list (Dictionary page).............................................................................................................. 11
2.5 About page..................................................................................................................................................... 11
2.6 How to view the hystogram.............................................................................................................. 12
3. Toolbar............................................................................................................................................................... 13
3.1 Open button................................................................................................................................................... 13
3.2 Options button............................................................................................................................................ 13
3.3 Font button.................................................................................................................................................. 14
3.4 Ordering button......................................................................................................................................... 14
3.5 Statistics button...................................................................................................................................... 14
3.6 Zoom Diagram button........................................................................................................................... 14
3.7 Search button............................................................................................................................................. 14
3.8 Help button................................................................................................................................................... 15
4. Options................................................................................................................................................................. 16
4.1 Languages...................................................................................................................................................... 16
4.2 Hide Processing checkbox..................................................................................................................... 16
4.3 Show Words for Topic checkbox..................................................................................................... 16
4.4 Convert Dictionaries to ANSI checkbox..................................................................................... 16
4.5 Convert Texts to ANSI checkbox..................................................................................................... 16
4.6 Scaling.............................................................................................................................................................. 16
5. Development team........................................................................................................................................ 18
6. Bibliography on Classifier................................................................................................................... 19
The Classifier Demo program allows you to investigate individual documents and sets of documents:
· To determe the topics mentioned in the document,
· To determe which documents are similar to the chosen one with respect to the topics that they mention,
· To choose the thematic aspect on the base of which the documents are compared,
· To get explanations on how the program calculates the results, and
· To investigate the program’s dictionary .
Please see the following topics:
How Do I..., page 3,
Screen Elements
, page 4,Development team, page 18.
This is a sample screen of the program:
The screen shows a chart of the principal topics in a Spanish document.
· To open a set of documents to classify, use the Open button .
· To view the thematic structure of an open document, swicth to the Document page and select this document in the Files list . Then swicth to the desired tab on the Results sheet .
· To determine which documents are most relevant for a specific topic, swich to the By Topic page and choose the topic in the Topic tree , then switch to the desired tab in the Results sheet . Hint: to find the desired topic, switch to the Dictionary page and use the Search button , then switch back to the By Topic page.
· To determine which documents are most close in their thematic structure to a given one, swicth to the By Document page , choose the document in the Files list , then switch to the desired tab in the Results sheet . You can also choose the thematic aspect in the Topic tree .
On the Classifier Demo screen, you can choose one of the following five pages by clicking on the tabs at the top of the screen:
Document page, page 4
.By Topic page, page 6
.By Document page, page 8
.Dictionary page, page 10
.About page, page 11
.Also on the Classifier Demo screen you will found the following elements
Toolbar, page 13 .
Options
, page 16The Document page allows you to investigate an individual document. It is divided in the three following areas:
Files list
.Topic tree
.Results sheet
.This list allows you to choose a document (by its file name) that you want to investigate. All the information on the Results sheet
in the right part of the screen will be presented with respect to the chosen document.The list shows the names of the files currently loaded along with the automatically detected language of the document. If the language is Default, then it is supposed to correspond to the language
chosen by the button on the toolbar .Double-clicking on an element of the list switches to the By Document page
and chooses the document as the base for comparison.This tree allows you to choose a topic or a branch in the hierarchy of topics, with respect to which the documet will be investigated. Only the chosen topics will be considered when determining the thematic profile of the document shown on the Results sheet
.The right-side area of the Document page
shows the thematic profile of the document. It allows you to view the data in one of the following modes:Text tab
.Chart tab
.Pie tab
.List tab
.
This text box shows the full text of the document.
To find a word in the text, use the
button on the toolbar .
The hystogram shows the most important topics mentioned in the document, arranged in the order of importance. The more times a topic is mentioned in the document the more important it is for this document.
There are two ways of calculating the importance of a topic switched by the
button on the toolbar .See also how to view the hystogram
.
The pie diagram shows in the pie form the same results as the Chart tab
.To view the pie diagram, we recommend you to zoom out your screen by the
button on the toolbar .
The list shows the topics found in the document and under each topic, the words voting for these topics
. The words are in the language of the document, while the names of topics are in the language of the dictionary (normally English).Next to each word, the number of occurrences or the relative weight of this word in the document is shown; next to each topic, the accumulated weight of the words voting for this topic or the total number of their occurrences is shown.
The order of topics depends on the
button on the toolbar . Whether the words for each topic are shown depends on the Show Words for Topic checkbox under the button.To jump to a desired topic or word, we recommend you to use the
Search button.The By Topic page allows you to investigate the set of the documents by topic. It consists of the following two areas:
Topic tree
.Results sheet
.The tree shows the topic hierarchy. When you choose a topic or a branch in the hierarchy, the Results sheet in the right part of the screen shows the statistics by the selected topic(s).
This is the same tree as all the other Topic Tree views in the program. Chaning the selection in this tree, you also change the selection on all the other pages; on the other hand, changes on the other pages affect the tree on this page.
To find a particular topic, we recommend you to switch to the Dictionary page
and use the Search button there, then to switch back to the By Topic page.The right-side area of the By Topic page
shows the weights of the topic selected in the Topic tree for all the documents. It allows you to view the data in one of the following modes:Chart tab
.Pie tab
.List tab
.
The hystogram shows the most important topics mentioned in the document, arranged in the order of importance. The more times a topic is mentioned in the document the more important it is for this document.
Depending on the state of the
button on the toolbar , the chart can be ordered in one of the two ways:· By relevance,
· By alphabet, according to the file names.
If the chart is ordered by relevance, then depending on how the weights of the documents are counted, the order of the documents can be different:
· With absolute weights, the larger documents usually go before the smaller documents, since they give more information (in total words) on the selected topic.
· With relative weights, the documents with higher concentration (in percents) of the words voting for the selected topic go first.
These two modes are switched by the
button on the toolbar .See also how to view the hystogram
.
The pie diagram shows in the pie form the same results as the Chart tab
.To view the pie diagram, we recommend you to zoom out your screen by the
button on the toolbar .
The list shows in a textual form the same results as the Chart tab
. For to each file, the absolute and relative weight of the chosen topic in the text of the document is shown.The order of elements depends on the state of the
button on the toolbar , see the explanations for the Chart tab .Double-clicking on a document switches to the Document page
and activates the chosen document, so that you can see its text or thematic structure.The By Topic page allows you to search for documents similar to a chosen in their thematic structure; the aspect of similarity depends on the selection of a topic
. The page consists of the following two areas:Files list
.Topic tree
.Results sheet
.This list allows you to choose a document with which all the other documents will be compared. The results of the comparison are shown on the Results sheet
in the right part of the screen.This tree allows you too choose the thematic aspect of the comparison of the documents. You can choos a topic of your interest or a branch in the topic hierarchy. To calculate the measure of similarity between two documents, only words relevant to the chosen topic(s) will be taken into account.
The aspect is important for comparison of the documents. For example, let the base document be devoted
· (0) to the use of the animals on the war,
and the other two are devoted
· (1) to the use of animals in circus, and
· (2) to the use of electronic devices on the war.
Then, from the aspect of biology, the document (1) is similar to the base document while (2) is not; on the other hand, from the point of view of military science, the document (2) is similar to the base document while (1) is not.
This is the same tree as all the other Topic Tree views in the program. Chaning the selection in this tree, you also change the selection on all the other pages; on the other hand, changes on the other pages affect the tree on this page.
To find a particular topic, we recommend you to switch to the Dictionary page
and use the Search button there, then to switch back to the By Topic page.The right-side area of the By Document page
shows the profile of similarity of the documents to the document chosen in the Files list . It allows you to view the data in one of the following modes:Chart tab
.Pie tab
.List tab
.
The hystogram shows the degree of similarity of the documents in the set to the document chosen in the Files list
.Depending on the state of the
button on the toolbar , the chart can be ordered in one of the two ways:· By relevance,
· By alphabet, according to the file names.
See also how to view the hystogram
.
The pie diagram shows in the pie form the same results as the Chart tab
.To view the pie diagram, we recommend you to zoom out your screen by the
button on the toolbar .
The list in the upper part of the screen shows in a textual form the same results as the Chart tab
. For to each file, the degree of similarity to the base document chosen in the Files list is shown.Double-clicking on a document switches to the Document page
and activates the chosen document, so that you can see its text or thematic structure.The list in the bottom part of the screen shows the protocol of the comparison between the base document and the document chosen in the upper list. To activate this list, click on a document in the upper list.
The Dictionary page allows you to investigate the current program’s dictionary. The current dictionary can be changed by the Languages radio buttons
under the button on the toolbar .The page consists of three areas:
Topic tree
.Words list
.Language list
.The Topic tree lists the hierarchy of topics in the dictionary. Choosing a topic (a terminal mode), you can see the list of the words voting for this topic in the Words list
view.To jump to a desired topic, we recommend you to use the
Search button on the toolbar .This is the same tree as all the other Topic Tree views in the program. Chaning the selection in this tree, you also change the selection on all the other pages; on the other hand, changes on the other pages affect the tree on this page.
In the Words list you can see the words that vote for the topic selected in the Topic tree
, in the language selected in the Language list .Each time when the program encounters a word from this list in the document, it increments the weight of relevance of its corresponding theme for this document.
In the Language list, you choose the language for which you want to view the words in the Words list
.The list of available languages depends on the Languages radio buttons
under the Options button on the toolbar.
The About page presents the development team of the program and the symbol of the Natural Language Laboratory of CIC-IPN.
Hystograms shown on the Chart tab
on the Document page , Chart tab on the By Topic page , and Chart tab on the Dictionary page , can be viewed in the following way:· To change the scale, point and drag with the left mouse button from top-left to right-down a rectangle that should be zoomed out to the entire area of the chart.
· To reset to the original scale, point and drag with the left mouse button any rectangle on the chart from right-down to top-left.
· To move the picture, use the right mouse button and drag a point on the chart.
When you open too many files, some file names are not shown on the chart. You can zoom the chart out to see the file names. Also you can see the names on the List tab.
The Toolbar provides access to the following settings and tools:
Open button .
Options button .
Font button .
Ordering button .
Statistics button .
Zoom Diagram button .
Search button .
Help button .
The Open button allows you to select the files (documents) to view. Use mouse drag and Ctrl-clicks to select multiple documents.
Be sure not to selects any objects that are not plain text documents, such as folders, links, etc. Currently Microsoft Word documents can not be open by the program.
You can only open one set of files at a time. When you open a set of files, the program closes the previously opened files.
The Options button provides access to the following settings:
Languages radio buttons
.Hide Processing checkbox
.Show Words for Topic checkbox
.Convert Dictionaries to ANSI checkbox
.Convert Texts to ANSI checkbox
.Scaling
.
The Font button allows you to change the text font in some of the program’s windows.
The Ordering button allows you to view the documents or other elements of lists by alphabet or by relevance.
Default state is by relevance.
There are two ways of calculating the importance of a topic for a document:
· By relative weight of the topic in the document (in percents to the number of the words in the document). With this method, the higher is the concentration of the words voting for a topic, the higher is the relevance of the topic for the document; it does not depend on the size of the document.
· By absolute weight of the topic.With this method, the higher is the total number of the words voting for a topic, the higher is the relevance of the topic for the document. Tipically this depends on the size of the documents: the larger is a document, the more information it presents on the chosen topic and thus the higher is its relevance.
When the Statistics button is pressed, the absolute weights are considered for any statistics, otherwise the relative weights. The documents and the topics are ordered in the hystograms and lists accordingly.
The Zoom Diagram button allows you to zoom the diagram or the dictionary, by removing the list of files or words from the screen.
We strongly recommend to use this button to view the pie diagrams.
The Search button allows you to search for the words in the lists or for the topic in the dictionary.
The Help button shows this guide.
The Languages radio buttons allow you to select the language of the documents and the dictionary used by the program.
· Auto |
– automatic detection of the language of document. |
· English |
– process all documents as English text. |
· Spanish |
– process all documents as Spanish text. |
· French |
– process all documents as French text. |
· Spanish (alt.) |
– process all documents as Spanish text using an alternative Spanish dictionary. |
We recommend you to set this option to Auto.
When the Hide Processing checkbox is unchecked, when you open new set of files with the
Open button, the program opens a temporary black window for processing of each document. This speeds up the processing, but looks annoying.When this checkbox is checked, the program processes the files in background, that slows down the processing.
We recommend you to uncheck this checkbox for your own work, but check it for the time of presentation of the program to other people.
When the Show Words for Topics checkbox is checked, the List tab
on the Results sheet on the Document page presents the words that vote for each topic.When this checkbox is unchecked, only the names of topics are presented.
When the Convert Dictionaries to ANSI checkbox is checked, the program assumes that the dictionaries are in OEM encoding and converts the words into ANSI (Windows) encoding when showing them on the screen.
When this checkbox is unchecked, no conversion is performed. The dictionary should be in ANSI (Windows) encoding.
This settings does not affect the way the documents are processing, only the way the words are shown on the screen.
When the Convert Text to ANSI checkbox is checked, the program assumes that the documents are in OEM encoding and converts the text into ANSI (Windows) encoding when showing it onscreen on the Document
page.When this checkbox is unchecked, no conversion is performed. The documents should be in ANSI (Windows) encoding, or else some letters in the text will be shown incorrectly.
Currently you cannot choose the encoding on per-document basis.
This settings does not affect the way the documents are processing, only the way the text is shown on the screen.
The Scaling value affects the way the distance between documents shown on By Document
page is measured. When this settings is set to too little value, all the documents are considered to be too close. When this settings is set to too high value, all the documents are considered to be too far. Choose an optimal setting to distinguish between documents close and far from the given one.We recommend you to use the settings between 1 and 100, usually 1 works well.
This software is (C) Copyright by
the Center for Computing Research of National Polytechnic Institute, Mexico.
It was developed by the Natural Language
Laboratory.
The program is parcially based on the ideas
of
the Clasitex
technology [1, 2, 3] developed by
Dr. Adolfo
Guzmán Arenas.
The Classifier Demo development team:
Design: Dr. Alexander Gelbukh,
Programming: Dr. Grigori Sidorov,
Data: Beatriz Beltrán, Sofía Galicia Haro.
1. Adolfo Guzmán-Arenas. Finding the main themes in a Spanish document. Journal Expert Systems with Applications, Vol. 14, No. 1/2. Jan/Feb 1998, pp. 139-148.
2. Adolfo Guzmán Arenas. Hallando los temas principales en un artículo en espańol. Soluciones Avanzadas. Vol. 5, No. 45, p. 58, No. 49, p. 66, 1997.
3. Adolfo Guzmán Arenas. Hallando los temas principales en un artículo en espańol. Proc. Simposium Internacional de Computación, IPN, 1997, Mexico.
4. Beatriz Beltrán Martínez, Adolfo Guzmán Arenas, Francisco Martínez Trinidad, José Ruiz Shulcloper. Clasitex++: una herramienta para el análisis de textos. Memorias del Tercer Taller Iberoamericano de Reconocimiento de Partones, TIARP-98. CIC, IPN, marzo 1998, pp. 369-379.
5. Alexander Gelbukh, Grigori Sidorov, Adolfo Guzmán-Arenas. Text categorization using a hierarchical topic dictionary. Proc. Text Mining workshop at 16th International Joint Conference on Artificial Intelligence (IJCAI'99), Stockholm, Sweden, July 31 – August 6, 1999, pp. 34-35. http://www.dsv.su.se/ijcai-99
6. Mikhail Alexandrov, Alexander Gelbukh. Measures for determining thematic structure of documents with Domain Dictionaries. Proc. Text Mining workshop at 16th International Joint Conference on Artificial Intelligence (IJCAI'99), Stockholm, Sweden, July 31 – August 6, 1999, pp. 10-12. http://www.dsv.su.se/ijcai-99
7. Alexander Gelbukh, Grigori Sidorov, A. Guzmán-Arenas. Document classification with a weighted topic hierarchy. Proc. 1st International Workshop on Document Analysis and Understanding for Document Databases (DAUDD’99), 10th International Conference and Workshop on Database and Expert Systems Applications (DEXA), Florence, Italy, September 1, 1999. IEEE Computer Society Press, pp. 566 - 570. http://mcculloch.ing.unifi.it/~docproc/DAUDD99/daudd_ program. html
8. A. Gelbukh, G. Sidorov, A. Guzman-Arenas. Use of a weighted topic hierarchy for text retrieval and classification. In Václav Matoušek et al. (Eds.). Text, Speech and Dialogue. Proc. 2nd International Workshop TSD-99, Plzen, Czech Republic, September 13-17, 1999. Lecture Notes in Artificial Intelligence, No. 1692, Springer, pp. 130–135. http://www-kiv.zcu.cz/events/tsd99/abstract.html
9. Alexander Gelbukh, Grigori Sidorov, Adolfo Guzman-Arenas. A system for search and classification of the documents with the use of a hierarchic thematic dictionary (in Russian). Accepted to Proc. 8th International Conference Knowledge-Dialogue-Solution (KDS–99), Yalta, Ukraine, September 13-18, 1999.
10.A. Gelbukh, G. Sidorov, and A. Guzmán-Arenas. A Method of Describing Document Contents through Topic Selection. Proc. SPIRE’99, International Symposium on String Processing and Information Retrieval, Cancun, Mexico, September 22 – 24. IEEE Computer Society Press, 1999, pp. 73-80. http://garota.fismat.umich.mx/spire99