National Polytechnic Institute
Center for Computing Research
Natural Language Processing Laboratory

 

 

 

 

 

Demo 3.1

 

User Manual

 

Alexander Gelbukh

Grigori Sidorov

 

 

 

 

 

Mexico City

1999


Contents

1.      Welcometo Classifier................................................................................................................................ 3

1.1        The main screen............................................................................................................................................ 3

1.2        How Do I.............................................................................................................................................................. 3

2.      Screen Elements............................................................................................................................................... 4

2.1        Document page.............................................................................................................................................. 4

2.1.1     Files list (Document page)........................................................................................................................... 4

2.1.2     Topic tree (Document page)......................................................................................................................... 4

2.1.3     Results sheet (Document page).................................................................................................................... 4

2.2        By Topic page................................................................................................................................................... 6

2.2.1     Topic tree (By Topic page)........................................................................................................................... 6

2.2.2     Results sheet (By Topic page)...................................................................................................................... 7

2.3        By Document page........................................................................................................................................ 8

2.3.1     Files list (By Document page)...................................................................................................................... 8

2.3.2     Topic tree (By Document page)................................................................................................................... 8

2.3.3     Results sheet (By Document page).............................................................................................................. 9

2.4        Dictionary page.......................................................................................................................................... 10

2.4.1     Topic tree (Dictionary page)..................................................................................................................... 11

2.4.2     Words list (Dictionary page)..................................................................................................................... 11

2.4.3     Language list (Dictionary page).............................................................................................................. 11

2.5        About page..................................................................................................................................................... 11

2.6        How to view the hystogram.............................................................................................................. 12

3.      Toolbar............................................................................................................................................................... 13

3.1        Open button................................................................................................................................................... 13

3.2        Options button............................................................................................................................................ 13

3.3        Font button.................................................................................................................................................. 14

3.4        Ordering button......................................................................................................................................... 14

3.5        Statistics button...................................................................................................................................... 14

3.6        Zoom Diagram button........................................................................................................................... 14

3.7        Search button............................................................................................................................................. 14

3.8        Help button................................................................................................................................................... 15

4.      Options................................................................................................................................................................. 16

4.1        Languages...................................................................................................................................................... 16

4.2        Hide Processing checkbox..................................................................................................................... 16

4.3        Show Words for Topic checkbox..................................................................................................... 16

4.4        Convert Dictionaries to ANSI checkbox..................................................................................... 16

4.5        Convert Texts to ANSI checkbox..................................................................................................... 16

4.6        Scaling.............................................................................................................................................................. 16

5.      Development team........................................................................................................................................ 18

6.      Bibliography on Classifier................................................................................................................... 19

 

1.     Welcometo Classifier

The Classifier Demo program allows you to investigate individual documents and sets of documents:

·                     To determe the topicsID_Document page mentioned in the document,

·                     To determe which documents are similarID_By Document page to the chosen one with respect to the topics that they mention,

·                     To choose the thematic aspectID_How Do I on the base of which the documents are compared,

·                     To get explanations on howID_How Do I the program calculates the results, and

·                     To investigate the program’s dictionaryID_Dictionary page.

Please see the following topics:

            How Do I..., page 3,ID_How Do I

            Screen ElementsID_Screen Elements, page 4,

            Development team, page 18.ID_Development team

1.1     The main screen

This is a sample screen of the program:

The screen shows a chart of the principal topics in a Spanish document.

1.2     How Do I...

·                     To open a set of documents to classify, use the ID_Open button Open buttonID_Open button.

·                     To view the thematic structure of an open document, swicth to the Document pageID_Document page and select this document in the Files listID_D_Files list. Then swicth to the desired tab on the Results sheetID_D_Results sheet.

·                     To determine which documents are most relevant for a specific topic, swich to the By Topic pageID_By Topic page and choose the topic in the Topic treeID_BT_Topic tree, then switch to the desired tab in the Results sheetID_BT_Results sheet. Hint: to find the desired topic, switch to the Dictionary pageID_Dictionary page and use the ID_Search button Search buttonID_Search button, then switch back to the By Topic page.

·                     To determine which documents are most close in their thematic structure to a given one, swicth to the By Document pageID_By Document page, choose the document in the Files listID_BD_Files list, then switch to the desired tab in the Results sheetID_BD_Results sheet. You can also choose the thematic aspect in the Topic treeID_BD_Topic tree.

2.     Screen Elements

On the Classifier Demo screen, you can choose one of the following five pages by clicking on the tabs at the top of the screen:

            Document page, page 4ID_How Do IID_Document page.

            By Topic page, page 6ID_By Topic page.

            By Document page, page 8ID_By Document page.

            Dictionary page, page 10ID_Dictionary page.

            About page, page 11ID_About page.

Also on the Classifier Demo screen you will found the following elements

            ToolbarID_ToolbarID_Open buttonID_Options buttonID_Font buttonID_Order buttonID_Statistics buttonID_Zoom Diagram buttonID_Search button, page 13ID_Help button.

            OptionsID_Options button, page 16

2.1     Document page

The Document page allows you to investigate an individual document. It is divided in the three following areas:

            Files listID_D_Files list.

            Topic treeID_D_Topic tree.

            Results sheetID_D_Results sheet.

2.1.1     Files list (Document page)

This list allows you to choose a document (by its file name) that you want to investigate. All the information on the Results sheetID_D_Results sheet in the right part of the screen will be presented with respect to the chosen document.

The list shows the names of the files currently loaded along with the automatically detected language of the document. If the language is Default, then it is supposed to correspond to the languageID_Languages chosen by the ID_Options button button on the toolbarID_Toolbar.

Double-clicking on an element of the list switches to the By Document pageID_By Document page and chooses the document as the base for comparison.

2.1.2     Topic tree (Document page)

This tree allows you to choose a topic or a branch in the hierarchy of topics, with respect to which the documet will be investigated. Only the chosen topics will be considered when determining the thematic profile of the document shown on the Results sheetID_D_Results sheet.

2.1.3     Results sheet (Document page)

The right-side area of the Document pageID_Document page shows the thematic profile of the document. It allows you to view the data in one of the following modes:

            Text tabID_Text tab.

            Chart tabID_D_Chart tab.

            Pie tabID_D_Pie tab.

            List tabID_D_List tab.

2.1.3.1     Text tab (Document page)

This text box shows the full text of the document.

To find a word in the text, use the ID_Search button ID_Search buttonbutton on the toolbarID_Toolbar.

2.1.3.2     Chart tab (Document page)

The hystogram shows the most important topics mentioned in the document, arranged in the order of importance. The more times a topic is mentioned in the document the more important it is for this document.

There are two ways of calculating the importance of a topic switched by the ID_Statistics button button on the toolbarID_Toolbar.

See also how to view the hystogramID_Hystogram.

2.1.3.3     Pie tab (Document page)

The pie diagram shows in the pie form the same results as the Chart tabID_D_Chart tab.

To view the pie diagram, we recommend you to zoom out your screen by the ID_Zoom Diagram button button on the toolbarID_Toolbar.

2.1.3.4     List tab (Document page)

The list shows the topics found in the document and under each topic, the words voting for these topicsID_D_Chart tab. The words are in the language of the document, while the names of topics are in the language of the dictionary (normally English).

Next to each word, the number of occurrences or the relative weight of this word in the document is shown; next to each topic, the accumulated weight of the words voting for this topic or the total number of their occurrences is shown.

The order of topics depends on the ID_Order button button on the toolbarID_Toolbar. Whether the words for each topic are shown depends on the Show Words for TopicID_Show Words for Topic checkbox checkbox under the ID_Options button button.

To jump to a desired topic or word, we recommend you to use the ID_Search button SearchID_Search button button.

2.2     By Topic page

The By Topic page allows you to investigate the set of the documents by topic. It consists of the following two areas:

            Topic treeID_BT_Topic tree.

            Results sheetID_BT_Results sheet.

2.2.1     Topic tree (By Topic page)

The tree shows the topic hierarchy. When you choose a topic or a branch in the hierarchy, the Results sheet in the right part of the screen shows the statistics by the selected topic(s).

This is the same tree as all the other Topic Tree views in the program. Chaning the selection in this tree, you also change the selection on all the other pages; on the other hand, changes on the other pages affect the tree on this page.

To find a particular topic, we recommend you to switch to the Dictionary pageID_Dictionary page and use the ID_Search button SearchID_Search button button there, then to switch back to the By Topic page.

2.2.2     Results sheet (By Topic page)

The right-side area of the By Topic pageID_By Topic page shows the weights of the topic selected in the Topic treeID_BT_Topic tree for all the documents. It allows you to view the data in one of the following modes:

            Chart tabID_BT_Chart tab.

            Pie tabID_BT_Pie tab.

            List tabID_BT_List tab.

2.2.2.1     Chart tab (By Topic page)

The hystogram shows the most important topics mentioned in the document, arranged in the order of importance. The more times a topic is mentioned in the document the more important it is for this document.

Depending on the state of the ID_Order button button on the toolbarID_Order button, the chart can be ordered in one of the two ways:

·                     By relevance,

·                     By alphabet, according to the file names.

If the chart is ordered by relevance, then depending on how the weights of the documents are counted, the order of the documents can be different:

·                     With absolute weights, the larger documents usually go before the smaller documents, since they give more information (in total words) on the selected topic.

·                     With relative weights, the documents with higher concentration (in percents) of the words voting for the selected topic go first.

These two modes are switched by the ID_Statistics button button on the toolbarID_Toolbar.

See also how to view the hystogramID_Hystogram.

2.2.2.2     Pie tab (By Topic page)

The pie diagram shows in the pie form the same results as the Chart tabID_BT_Chart tab.

To view the pie diagram, we recommend you to zoom out your screen by the ID_Zoom Diagram button button on the toolbarID_Toolbar.

2.2.2.3     List tab (By Topic page)

The list shows in a textual form the same results as the Chart tabID_BT_Chart tab. For to each file, the absolute and relative weight of the chosen topic in the text of the document is shown.

The order of elements depends on the state of the ID_Statistics button button on the toolbarID_Toolbar, see the explanations for the Chart tabID_BT_Chart tab.

Double-clicking on a document switches to the Document pageID_Document page and activates the chosen document, so that you can see its text or thematic structure.

2.3     By Document page

The By Topic page allows you to search for documents similar to a chosen in their thematic structure; the aspect of similarity depends on the selection of a topicID_BD_Topic tree. The page consists of the following two areas:

            Files listID_BD_Files list.

            Topic treeID_BD_Topic tree.

            Results sheetID_BD_Results sheet.

2.3.1     Files list (By Document page)

This list allows you to choose a document with which all the other documents will be compared. The results of the comparison are shown on the Results sheetID_BD_Results sheet in the right part of the screen.

2.3.2     Topic tree (By Document page)

This tree allows you too choose the thematic aspect of the comparison of the documents. You can choos a topic of your interest or a branch in the topic hierarchy. To calculate the measure of similarity between two documents, only words relevant to the chosen topic(s) will be taken into account.

The aspect is important for comparison of the documents. For example, let the base document be devoted

·                     (0) to the use of the animals on the war,

and the other two are devoted

·                     (1) to the use of animals in circus, and

·                     (2) to the use of electronic devices on the war.

Then, from the aspect of biology, the document (1) is similar to the base document while (2) is not; on the other hand, from the point of view of military science, the document (2) is similar to the base document while (1) is not.

This is the same tree as all the other Topic Tree views in the program. Chaning the selection in this tree, you also change the selection on all the other pages; on the other hand, changes on the other pages affect the tree on this page.

To find a particular topic, we recommend you to switch to the Dictionary pageID_Dictionary page and use the ID_Search button SearchID_Search button button there, then to switch back to the By Topic page.

2.3.3     Results sheet (By Document page)

The right-side area of the By Document pageID_By Document page shows the profile of similarity of the documents to the document chosen in the Files listID_BD_Files list. It allows you to view the data in one of the following modes:

            Chart tabID_BD_Chart tab.

            Pie tabID_BD_Pie tab.

            List tabID_BD_List tab.

2.3.3.1     Chart tab (By Document page)

The hystogram shows the degree of similarity of the documents in the set to the document chosen in the Files listID_BD_Files list.

Depending on the state of the ID_Order button button on the toolbarID_Order button, the chart can be ordered in one of the two ways:

·                     By relevance,

·                     By alphabet, according to the file names.

See also how to view the hystogramID_Hystogram.

2.3.3.2     Pie tab (By Document page)

The pie diagram shows in the pie form the same results as the Chart tabID_BD_Chart tab.

To view the pie diagram, we recommend you to zoom out your screen by the ID_Zoom Diagram button button on the toolbarID_Toolbar.

2.3.3.3     List tab (By Document page)

The list in the upper part of the screen shows in a textual form the same results as the Chart tabID_BD_Chart tab. For to each file, the degree of similarity to the base document chosen in the Files listID_BD_Files list is shown.

Double-clicking on a document switches to the Document pageID_Document page and activates the chosen document, so that you can see its text or thematic structure.

The list in the bottom part of the screen shows the protocol of the comparison between the base document and the document chosen in the upper list. To activate this list, click on a document in the upper list.

2.4     Dictionary page

The Dictionary page allows you to investigate the current program’s dictionary. The current dictionary can be changed by the Languages radio buttonsID_Languages under the ID_Options button button on the toolbarID_Toolbar.

The page consists of three areas:

            Topic treeID_DIC_Topic tree.

            Words listID_Words list.

            Language listID_Language list.

2.4.1     Topic tree (Dictionary page)

The Topic tree lists the hierarchy of topics in the dictionary. Choosing a topic (a terminal mode), you can see the list of the words voting for this topic in the Words listID_Words list view.

To jump to a desired topic, we recommend you to use the ID_Search button SearchID_Search button button on the toolbarID_Toolbar.

This is the same tree as all the other Topic Tree views in the program. Chaning the selection in this tree, you also change the selection on all the other pages; on the other hand, changes on the other pages affect the tree on this page.

2.4.2     Words list (Dictionary page)

In the Words list you can see the words that vote for the topic selected in the Topic treeID_DIC_Topic tree, in the language selected in the Language listID_Language list.

Each time when the program encounters a word from this list in the document, it increments the weight of relevance of its corresponding theme for this document.

2.4.3     Language list (Dictionary page)

In the Language list, you choose the language for which you want to view the words in the Words listID_Words list.

The list of available languages depends on the Languages radio buttonsID_Languages under the OptionsID_Options buttonID_Options button button on the toolbar.

2.5     About page

The About page presents the development team of the program and the symbol of the Natural Language Laboratory of CIC-IPN.

2.6     How to view the hystogram

Hystograms shown on the Chart tabID_D_Chart tab on the Document pageID_Document page, Chart tabID_BT_Chart tab on the By Topic pageID_By Topic page, and Chart tabID_BD_Chart tab on the Dictionary pageID_Dictionary page, can be viewed in the following way:

·                     To change the scale, point and drag with the left mouse button from top-left to right-down a rectangle that should be zoomed out to the entire area of the chart.

·                     To reset to the original scale, point and drag with the left mouse button any rectangle on the chart from right-down to top-left.

·                     To move the picture, use the right mouse button and drag a point on the chart.

When you open too many files, some file names are not shown on the chart. You can zoom the chart out to see the file names. Also you can see the names on the List tab.

3.     Toolbar

ID_Open buttonID_Options buttonID_Font buttonID_Order buttonID_Statistics buttonID_Zoom Diagram buttonID_Search buttonID_Help button

The Toolbar provides access to the following settings and tools:

            ID_Open button Open buttonID_Open button.

            ID_Options button Options buttonID_Options button.

            ID_Font button Font buttonID_Font button.

            ID_Order button Ordering buttonID_Order button.

            ID_Statistics button Statistics buttonID_Statistics button.

            ID_Zoom Diagram button Zoom Diagram buttonID_Zoom Diagram button.

            ID_Search button Search buttonID_Search button.

            ID_Help button Help buttonID_Help button.

3.1     Open button

 

The Open button allows you to select the files (documents) to view. Use mouse drag and Ctrl-clicks to select multiple documents.

Be sure not to selects any objects that are not plain text documents, such as folders, links, etc. Currently Microsoft Word documents can not be open by the program.

You can only open one set of files at a time. When you open a set of files, the program closes the previously opened files.

3.2     Options button

 

The Options button provides access to the following settings:

            Languages radio buttonsID_Languages.

            Hide Processing checkboxID_Hide Processing checkbox.

            Show Words for Topic checkboxID_Show Words for Topic checkbox.

            Convert Dictionaries to ANSI checkboxID_Convert Dictionaries to ANSI checkbox.

            Convert Texts to ANSI checkboxID_Convert Texts to ANSI checkbox.

            ScalingID_Scaling.

3.3     Font button

 

The Font button allows you to change the text font in some of the program’s windows.

3.4     Ordering button

 

The Ordering button allows you to view the documents or other elements of lists by alphabet or by relevance.

Default state is by relevance.

3.5     Statistics button

 

There are two ways of calculating the importance of a topic for a document:

·                     By relative weight of the topic in the document (in percents to the number of the words in the document). With this method, the higher is the concentration of the words voting for a topic, the higher is the relevance of the topic for the document; it does not depend on the size of the document.

·                     By absolute weight of the topic.With this method, the higher is the total number of the words voting for a topic, the higher is the relevance of the topic for the document. Tipically this depends on the size of the documents: the larger is a document, the more information it presents on the chosen topic and thus the higher is its relevance.

When the Statistics button is pressed, the absolute weights are considered for any statistics, otherwise the relative weights. The documents and the topics are ordered in the hystograms and lists accordingly.

3.6     Zoom Diagram button

 

The Zoom Diagram button allows you to zoom the diagram or the dictionary, by removing the list of files or words from the screen.

We strongly recommend to use this button to view the pie diagrams.

3.7     Search button

 

The Search button allows you to search for the words in the lists or for the topic in the dictionary.

3.8     Help button

 

The Help button shows this guide.

4.     Options

4.1     Languages

The Languages radio buttons allow you to select the language of the documents and the dictionary used by the program.

 

·      Auto

– automatic detection of the language of document.

·      English

– process all documents as English text.

·      Spanish

– process all documents as Spanish text.

·      French

– process all documents as French text.

·      Spanish (alt.)

– process all documents as Spanish text using an alternative Spanish dictionary.

 

We recommend you to set this option to Auto.

4.2     Hide Processing checkbox

When the Hide Processing checkbox is unchecked, when you open new set of files with the ID_Open button OpenID_Open button button, the program opens a temporary black window for processing of each document. This speeds up the processing, but looks annoying.

When this checkbox is checked, the program processes the files in background, that slows down the processing.

We recommend you to uncheck this checkbox for your own work, but check it for the time of presentation of the program to other people.

4.3     Show Words for Topic checkbox

When the Show Words for Topics checkbox is checked, the List tabID_D_List tab on the Results sheetID_D_Results sheet on the Document pageID_Document page presents the words that vote for each topic.

When this checkbox is unchecked, only the names of topics are presented.

4.4     Convert Dictionaries to ANSI checkbox

When the Convert Dictionaries to ANSI checkbox is checked, the program assumes that the dictionaries are in OEM encoding and converts the words into ANSI (Windows) encoding when showing them on the screen.

When this checkbox is unchecked, no conversion is performed. The dictionary should be in ANSI (Windows) encoding.

This settings does not affect the way the documents are processing, only the way the words are shown on the screen.

4.5     Convert Texts to ANSI checkbox

When the Convert Text to ANSI checkbox is checked, the program assumes that the documents are in OEM encoding and converts the text into ANSI (Windows) encoding when showing it onscreen on the DocumentID_Document page page.

When this checkbox is unchecked, no conversion is performed. The documents should be in ANSI (Windows) encoding, or else some letters in the text will be shown incorrectly.

Currently you cannot choose the encoding on per-document basis.

This settings does not affect the way the documents are processing, only the way the text is shown on the screen.

4.6     Scaling

The Scaling value affects the way the distance between documents shown on By DocumentID_By Document page page is measured. When this settings is set to too little value, all the documents are considered to be too close. When this settings is set to too high value, all the documents are considered to be too far. Choose an optimal setting to distinguish between documents close and far from the given one.

We recommend you to use the settings between 1 and 100, usually 1 works well.

5.     Development team     

 

This software is (C) Copyright by
the Center for Computing Research of National Polytechnic Institute, Mexico.
It was developed by the Natural Language Laboratory.

 

        

 

The program is parcially based on the ideas of
the ClasitexID_Clasitex technology [1, 2, 3] developed by
Dr. Adolfo Guzmán Arenas.

 

The Classifier Demo development team:

Design: Dr. Alexander Gelbukh,
Programming: Dr. Grigori Sidorov,
Data: Beatriz Beltrán, Sofía Galicia Haro.

6.     Bibliography on Classifier

1.    Adolfo Guzmán-Arenas. Finding the main themes in a Spanish document. Journal Expert Systems with Applications, Vol. 14, No. 1/2. Jan/Feb 1998, pp. 139-148.

2.    Adolfo Guzmán Arenas. Hallando los temas principales en un artículo en espańol. Soluciones Avanzadas. Vol. 5, No. 45, p. 58, No. 49, p. 66, 1997.

3.    Adolfo Guzmán Arenas. Hallando los temas principales en un artículo en espańol. Proc. Simposium Internacional de Computación, IPN, 1997, Mexico.

4.    Beatriz Beltrán Martínez, Adolfo Guzmán Arenas, Francisco Martínez Trinidad, José Ruiz Shulcloper. Clasitex++: una herramienta para el análisis de textos. Memorias del Tercer Taller Iberoamericano de Reconocimiento de Partones, TIARP-98. CIC, IPN, marzo 1998, pp. 369-379.

5.    Alexander Gelbukh, Grigori Sidorov, Adolfo Guzmán-Arenas. Text categorization using a hierarchical topic dictionary. Proc. Text Mining workshop at 16th International Joint Conference on Artificial Intelligence (IJCAI'99), Stockholm, Sweden, July 31 – August 6, 1999, pp. 34-35. http://www.dsv.su.se/ijcai-99

6.    Mikhail Alexandrov, Alexander Gelbukh. Measures for determining thematic structure of documents with Domain Dictionaries. Proc. Text Mining workshop at 16th International Joint Conference on Artificial Intelligence (IJCAI'99), Stockholm, Sweden, July 31 – August 6, 1999, pp. 10-12. http://www.dsv.su.se/ijcai-99

7.    Alexander Gelbukh, Grigori Sidorov, A. Guzmán-Arenas. Document classification with a weighted topic hierarchy. Proc. 1st International Workshop on Document Analysis and Understanding for Document Databases (DAUDD’99), 10th International Conference and Workshop on Database and Expert Systems Applications (DEXA), Florence, Italy, September 1, 1999. IEEE Computer Society Press, pp. 566 - 570. http://mcculloch.ing.unifi.it/~docproc/DAUDD99/daudd_ program. html

8.    A. Gelbukh, G. Sidorov, A. Guzman-Arenas. Use of a weighted topic hierarchy for text retrieval and classification. In Václav Matoušek et al. (Eds.). Text, Speech and Dialogue. Proc. 2nd International Workshop TSD-99, Plzen, Czech Republic, September 13-17, 1999. Lecture Notes in Artificial Intelligence, No. 1692, Springer, pp. 130–135. http://www-kiv.zcu.cz/events/tsd99/abstract.html

9.    Alexander Gelbukh, Grigori Sidorov, Adolfo Guzman-Arenas. A system for search and classification of the documents with the use of a hierarchic thematic dictionary (in Russian). Accepted to Proc. 8th International Conference Knowledge-Dialogue-Solution (KDS–99), Yalta, Ukraine, September 13-18, 1999.

10.A. Gelbukh, G. Sidorov, and A. Guzmán-Arenas. A Method of Describing Document Contents through Topic Selection. Proc. SPIRE’99, International Symposium on String Processing and Information Retrieval, Cancun, Mexico, September 22 – 24. IEEE Computer Society Press, 1999, pp. 73-80. http://garota.fismat.umich.mx/spire99