I.A. Bolshakov and A.F. Gelbukh. Classification of Collocations in a Lexical Database by Meaning of the Combined Words. In A. Guzman, R. Menchaca (eds.). Selected papers CIC-1999, CIC, IPN, Mexico City, 2000, ISBN 970-18-4250-2, pp. 5-15.

A revised version of: Rubrification of Word Combinations in the Databases by Elements of Meaning of Combined Words (in Russian). J. Nauchno-Tehnicheskaya Informaciya (NTI), ser. 2, vol. 2, Moscow, Russia, 2000.

Classification of Collocations in Databases
by Meaning of Combined Words*

I. A. Bolshakov & A. F. Gelbukh

ABSTRACT

A method is developed how to classify and supply with rubrics the sets of attributive collocations (word combinations) in large databases of collocations. The method is equally applicable to different natural languages, but the basic corpus of collocations and examples are taken Russian.

INTRODUCTION

In recent years, the databases with word combinations (collocations) of natural languages attract more and more attention. Such databases and published dictionaries were created at least for English [1], Italian [2], and Russian [3]. These databases can have two important applications:

·        as a reference guide for authors preparing texts at computer and

·        as a tool for filtering variants of analysis y synthesis in the systems of automatic processing of texts in natural languages.

This comprehension of how to use word collocations is already established, and for characterization of the phenomena under investigation the term word attraction appeared.

However, till now we know only one experimental system for text preparation using embedded database of collocations. This system under name CrossLexica was created by authors of this article [3, 4]. Just its database is used as a basic corpus for purposes of this work.

Crosslexica is divided to subsystems reflecting various classes of syntactic relations  between words. The subsystem called Has_Attributes, after receivingthe query in the form of a keyword of any part of speech, outputs the collocations in which the keyword is supplied with a syntactically subordinated entity in the shape of other word or another short collocation. When the keyword is a noun, the subordinated words are adjectives or adjectival constructions. For example, for English keyword man, the attributive collocations include acute, aggressive, all-round, average, bad, big, blind, clever, colored, dead, honest, of ability, of action, of affairs, of many accomplishments... For an adjectival, verbal or adverbial keyword, the related words are adverbs or adverbial constructions. In this article, we are interested in substantive keywords, which constitute the majority in the subsystem under investigation.

All recent years, the number of nouns characterized in our database by attributive collocations was increasing, as well as the mean of collocations related with each keyword. Nowadays this mean slightly exceeded 11.

Let us name the quantitative measure for a noun to form collocations of the specific type its fertility. Fertility of various nouns is quite different. The unevenness of many linguistic distributions is well known, so it was quite expectable in this case also. The most fertile nouns are characterized by hundreds different attributes, whereas the attributes for the majority of the rest are few. To be more specific, the noun from the first most fertile hundred form no less than 95 collocations, from two first hundreds, no less than 72, from three first hundreds, no less than 60. The list of the attributive combinations for a rather fertile noun cannot be allocated in the screen. To search such lists through for necessary attributes by their meaning takes more y more time. Hence, the necessity arose to facilitate the search by automatic means.

In this article, the idea is proposed to divide and classify the lists of the attributive collocations into rubrics related to the elements of semantic interpretation of the attributive words. The rubrics are selected independently for each keyword. In the first place, they are necessary for the first several hundreds of the most fertile nouns, but principally are useful for all the rest (nowadays, the system contains 12.200 nouns with attributive combinations).

The rubrics appearing within the mentioned lists are similar to those in the ideographic dictionaries, named thesauri [5]. However, the thematic rubrics in the thesauri can be taken in the shape of scientific terms and constructions not common for everyday speech. For our purposes, we try select, as the rubric titles, common words or collocations. On each our step, we tried to minimize the total number of rubrics that constitute the hierarchies and conserve consistency and intelligibility of the rubrics for an educated user without any linguistic background.

The exposition is given below according to the inductive principle. We take several nouns from the most fertile ones and try to clear up, what specific rubrics are needed for their attributive collocations. Nouns rather remote in their semantics are studied, to clear up minimal needed quantity of types of titles and subtitles for construing hierarchical rubric system. With all of this, our work does not pretend to a deep semantic research of the taken words. It is more important for us, that the corpus of collocations suggests us specific classification decisions, and we demonstrate them.

Basing of our experience in the rubric construction, some preliminary conclusions are given, with numerous reservations. Specifically, the developed conception of classification is admitted realizable and useful, but very labor-consuming and with essential limitations, seemed to be principally irremovable. It is shown also that the rubrics selected for attributes are a good tool for so-called portraying of the nouns under investigation. The portraying is interpreted as the linguistic research of situations, which are typical for these nouns in their common use, as well as of typical roles relevant for these situations.

Getting over to specific highly fertile nouns, we should mention that their singular and plural are considered separately in our database. It is explained by that different numbers of the same noun can have different sets of attributes [6], so that their fertilities can be different also.

KEYWORD человек ‘man, human’

The keyword человек, with the fertility rank 1 in our database (more than 800 attributive collocations), in its semantics represents in its semantics an argument of multiplicity of various predicates, usually of estimate type. To classify all these predicates faultlessly is scarcely possible. Hence, below an approximation is proposed with some disadvantages noticeable in spite of all our efforts to eliminate them.

In the highest level of rubrics, all attributes were divided to social, behavior, moral, intellectual, and outward (physical) features of man. Each of groups mentioned is divided to more subdivided rubrics, somewhat intersecting. The results with corresponding Russian examples are given below. Each rubric includes both positive and negative estimate of corresponding feature.

Social features:

·        Importance (social significance, estate): бесполезный, большой, великий, влиятельный, дорогой, замечательный, крупный, лишний, любимый, маленький, незаметный, некудышный, ничтожный, нужный, полезный, простой, пустой, родной, с большой буквы, средний, уважаемый,...

·        Notoriety: близкий, знакомый, знаменитый, известный, незнакомый, новый, свой, таинственный,...

·        Exceptionality: выдающийся, замечательный, интересный, исключительный, любопытный, неинтересный, необыкновенный, необычный, нормальный, обычный, особенный, своеобразный, средний,...

·        Prosperity: бедный, богатый, зажиточный, из среднего класса, небогатый, нищий, обеспеченный, сверхбогаый, состоятельный,...

·         Family status: вдовый, одинокий, разведенный, семейный, холостой,...

·        Social class: военный, городской, гражданский, рабочий, сельский,...

Behavior features:

·        Sociability: болтливый, занудный, приветливый, разговорчивый, нахальный, откровенный, открытый, шумный, замкнутый, застенчивый, молчаливый, робкий, сдержанный,...

·        Breeding: бестактный, вежливый, внимательный, грубый, дерзкий, дикий, культурный, любезный, невежливый, невоспитанный, некультурный, тактичный,...

·         Enterprisingness: активный, безынициативный, изобретательный, инициативный, пассивный, предприимчивый, творческий,...

·        Practicalness: деловой, беспомощный, бывалый, опытный, практический, практичный, расчетливый, трезвый, хозяйственный, экономный, непрактичный, расточительный,...

·        Temper: активный, беспокойный, бесстрастный, бесчувственный, бодрый, бойкий, влюбчивый, восторженный, впечатлительный, вспыльчивый, выдержанный, горячий, деятельный, живой, задумчивый, инертный, истеричный, капризный, медлительный,...

·        Character: безвольный, бесстрашный, властный, властолюбивый, волевой, высокомерный, гордый, деспотичный, демократичный, доверчивый, железный, заносчивый, ленивый, легкий, легкоранимый, мечтательный,...

Moral features:

·         Conscientiousness: аккуратный, беззаботный, безответственный, беспечный, добросовестный, исполнительный, неаккуратный, небрежный, недобросовестный, необязательный, несерьезный, обязательный, сознательный,...

·        Kindness: агрессивный, безжалостный, бескорыстный, беспощадный, бессердечный, гостеприимный, добродушный, доброжелательный, добрый, дружелюбный, жадный, жесткий, жестокий, заботливый, золотой, миролюбивый,...

·         Moral level: безнравственный, безыдейный, беспринципный, бесстыдный, благородный, великодушный, грязный, идейный, искренний, испорченный, коварный, лживый, лукавый, мелкий, мелочный, мстительный, непорядочный,...

Intellectual features:

·         Education: грамотный, интеллигентный, компетентный, невежественный, неграмотный, отсталый, передовой, прогрессивный, развитой, темный,...

·        Endowments: гениальный, даровитый, одаренный, способный, бездарный, бесталанный,...

·        Luck: несчастый, неудачливый, счастливый, удачливый,...

·        Intellect in general: беспристрастный, благоразумный, вдумчивый, памятливый, здравомыслящий, мудрый, любознательный, наблюдательный, неглупый, остроумный, пошлый, разумный, сообразительный, толковый, умный, хитрый,...

·        Skills: ловкий, неловкий, неумелый, умелый, сноровистый,...

Outward features:

·        Age: в годах, в цветущем возрасте, взрослый, молодой, немолодой, пожилой, среднего возраста, старый,...

·        Hairs: бородатый, волосатый, кудрявый, лысый, плешивый, русоволосый, светловолосый, седой, темноволосый, усатый, черноволосый,...

·        Eyes: глазастый, голубоглазый, кареглазый, сероглазый, синеглазый, темноглазый, черноглазый,...

·        Health: болезненный, больной, здоровый, цветущего вида, хилый,...

·        Skin: бледный, загорелый, краснолицый, розовощекий, румяный, смуглый,...

·        Mood: веселый, довольный, радостный, грустный, печальный, недовольный, разгневанный, расстроенный, сумрачный, мрачный, невеселый, угрюмый, хмурый,...

·         Clothing: бедно одетый, голый, легко одетый, плохо одетый, разодетый, хорошо одетый,...

·        Outward attractiveness: интересный, красивый, милый, некрасивый, неинтересный, обаятельный, отталкивающий, приятный, противный, симпатичный, славный,...

·        Size: крупный, мелкий, широкоплечий, узкоплечий,...

·        Height: высокий, высокого роста, коренасный, невысокий, низенький, низкого роста, приземистый, рослый, среднего роста,...

·        Strength: крепкий, могучий, мускулистый, сильный, слабый, хилый,...

·        Build: атлетически сложенный, длинноногий, длинношеий, коротконогий, кривоногий, статный, стройный, сутулый, хорошо сложенный, хрупкий,...

·        Nutritional state: дородный, жирный, корпулентный, костлявый, плотный, полный, средней упитанности, сухопарый, сытый, толстый, тощий, тучный, упитанный, худой,...

·        Physical deficiency: близорукий, глухой, косой, немой, раненый, слепой, смешной, хромой,...

The given rubrics cover more than 95% of all attributive collocations for the keyword, but not all of them. From those not yet covered there can be indicated:

·        Attributes of time and place (temporal and local frames) of existence of attributed keyword: древний, советский, современный, средневековый,...

·        Attributes of quantifying, determining, and demonstrative type: всякий, другой, каждый, конкретный, любой, отдельный, указанный, этот, первый, второй,... This set is few and closed. It characterizes means of singling out the keyword in tha speech. We cannot manage to invent an ordinary name for this group.

·        Attributes forming phrasemes like снежный человек ’yeti’. Their meaning cannot be reduced to the combination of meanings of the constituents. In this case, the phraseme define not a man, but mythic ape. For other keywords, phrasemes can be more numerous.

At last, it stays unclear to how to rubricate, even after broadening of rubrics, the attributes like мертвый ‘dead’, полуживой ‘half dead’, свободный ‘free’.... Assigning a separate rubric to each small group of attributes makes the classification too uneven and thus we have to assign the small items to the unified rubrics Miscellanies, and this is to introduce a “dump” for the all nonstandard.

KEYWORDS покрытие ‘cover, coating’ и покрытия ‘covers’

The keywords покрытие и покрытия have the ranks 2 and 3 correspondingly in our database. According to the simplified interpretation, wirdform покрытие include the meaning of both result and process. The second meaning is not so fertile as to attributes, but it is just the cause why the wordform of singular left behind the wordform of plural, which rarely has processing interpretation.

From the viewpoint of lexical semantics [7], покрытие as a result is the value of the lexical function Sres  from the predicate покрывать’to cover’. This predicate has four arguments: subject, object, means and instrument. In our database, relevant collocation turned to be so numerous because of broad use of this term in technology. It is not only a term as such, but is able to generate more narrow terms by means of with attributes.

In technological use of the term покрытие, the set of usual predicate actants turned to be reduced at the expense of the subject, but adopted additional elements at the expense of circonstants. There were added: an objective (destiny) of the coating and a set of qualities attending the coated product. The latter are sometimes difficult to distinguish from objectives, for non-specialist.

Hence, the set of attributes for покрытие is empirically divided to the following groups:

Object of coating (what is covered?): автомобильное, аэродромное, дорожное, мостовое, напольное, палубное, чердачное,...

Material of coating (with what is it covered?): алмазное, алюминиевое, асфальтовое, битумное, водное, ворсистое, гравийное, графитовое, золотое, каучуковое, керамическое,...

Method of coating (by what way is it covered?): анодированное, быстросохнущее, вакуумное, обжиговое, напыленное,...

Objective of coating (why is it covered?): антиадгезивное, антибактериальное, антибликовое, антигрибковое, армирующее, атмосферостойкое, взрывобезопасное, герметизирующее, декоративное, защитное, защитно-декоративное, специальное,...

Outward or constructive property of coating (what feature accompanies the coating?): бесшовное, влагочувствительное, блестящее, временное, водонерастворимое, вспучивающееся, гибкое, гладкое, гофрированное, неровное, нестойкое, постоянное, прочное, сплошное, стандартное, стойкое, съемное, унифицированное, устойчивое, эффективное, яркое...

The quantifying, determining, and demonstrative attributes stayed by themselves: любое, любые, каждое, все, отдельные, другое, указанное, это,...

KEYWORD среда ‘environment, medium, milieu’

Keyword среда1 ‘environment, medium’ has the rank 4. This lexeme is used, first, as a usual word characterizing milieu of a human created by other people. Second, it is a frequent and fertile sci-tech term for environment and medium of lifeless objects, and in this role can be easily classified in more detail. In correspondence to this, we have the following groups for среда1.

Living beings: артистическая, архитектурная, военная, враждебная, высокообразованная, гнилая, городская, затхлая, интеллигентная, культурная, мещанская, научная, рабочая, языковая,...

Lifeless objects:

·        Contents: аммиачная, аргоновая, атмосферная, атомная, аэрозольная, бактериологическая, безводная, безмасляная, белковая, бинарная, биологическая, водная, водно-органическая, водно-спиртовая, водяная, воздушная, воздушно-водяная, кислая, щелочная,...

·        Main property: абразивная, агрессивная, активная, взрывобезопасная, взрывоопасная, влажная, высококонцентрированная,...

·        Structure: аморфная, анизотропная, гетерогенная, гетерофазная, гомогенная, градиентная, двухмерная, замкнутая, неоднородная, однородная,...

·        Scope: внешняя, внутренняя, внутриклеточная, географическая, геологическая, неограниченная,...

KEYWORD вид1 ‘view’

The keyword вид1 has the rank 5. On the upper level, the corresponding attributes are strictly divisible to two rubrics, relating to a living being (as a rule, to a human) and to a lifeless object. The intersection of these group is limited: внешний, городской, деревенский, жалкий, красивый, мрачный,...

The further classification of attributes for living beings is not so obvious. Below, they are divided proceeding from the estimate of outward appearance of the being (effect produced by its appearance), emotional, and physical (to be more exact, physiological) state of this being. It is important to note that estimates are given to characterized person, whereas estimator is always an outer observer. Attributes for lifeless objects are rarer, and were left without further classification.

Thus, we propose the following rubrics for attributes of вид1:

Living beings:

·        Outward effect: ангельский, аристократический, безобразный, благородный, блестящий, бродяжий, важный, величественный, внушительный, вороватый, впечатляющий, вульгарный, гадкий, глуповатый, глупый, дегенеративный, достойный, дурацкий, жалкий, жуликоватый, затрапезный, значительный, идиотский, импозантный, интеллигентный, командирский, комичный,...

·        Emotional state: безразличный, беспокойный, беспомощный, благоразумный, благодушный, блаженный, бойкий, бравый, вдумчивый, веселенький, веселый, виноватый, воинственный, враждебный, вызывающий, гадливый, глубокомысленный, горделивый, гордый, грозный, грустный, деловой,...

·        Physiological state: анемичный, болезненный, больной, возбужденный, вялый, заспанный, здоровый, изможденный, измученный, испитой, истасканный, истерзанный, молодой, моложавый, нездоровый,...

Lifeless objects: архивированный, внешний, внутренний, выгодный, главный, городской, готовый, декоративный, деревенский, дивный, дикий, дорогой, достойный, естественный, жалкий, живописный, завуалированный, запущенный, засущенный, затейливый, изолированный, изумительный, искаженный, сжатый,...

It may be mentioned that the attributes for lifeless objects intersect with those for living beings only in the outward effect produced, since lifeless object cannot be assessed by their emotional and physiological state. As to outward effect and emotional state, they can be further classified from the positions defined earlier for человек ‘man, human’.

KEYWORD контроль ‘checking, inspection, test’

The keyword контроль has the rank 6. From the viewpoint of lexical semantics, it is the name of a predicate with the following set of actants: a checking subject, an object under check (e.g., контроль входящих ‘check of incomers’), and a checked parameter (e.g., контроль на точность ‘accuracy check’). In our database, this keyword obtained such a high rank because of the broad diffusion of this term in technology. In applications to attributes, the set of actants has slightly changed and broadened, and just this broadened set implies the rubrics:

Subject of check (who or what is checking?): авторский, ведомственный, врачебный, государственный, демократический, дизайнерский, диспетчерский, инспекционный, народный, рабочий,...

Object of check (and simultaneously, what parameter is checked?): бактериологический, билетный, валютный, ветеринарно-санитарный, гормональный, допинговый, допусковый,...

Objective of check (for what purposes something is checked?): антидопинговый ‘antidrug’, антимонопольный ‘antimonopoly’. This group turned to be very scarce and thus maybe it ought to be united with the previous one. Note that dope check допинговый контроль ‘drug test’and антидопинговый контроль ‘antidrug test’ are quite the same in Russian.

Method of check (with what or by what method is something checked?): автоматизированный, автоматический, аналитический, аппаратный, банковский, бесконтактный, биологический, вибрационный, визуальный, выборочный, групповой, дискретный, дистанционный, интрументальный, по мелочам, поканальный, ручной,...

Quality of check (how or with what quality is something checked?): аккуратный, активный, бдительный, внимательный, всесторонний, высосопроизводительный, действенный, жесткий, неослабный, постоянный,...

Place of check (where is something checked?): внутриведомственный, внутризаводской, входной, выходной, наземный, пограничный,...

KEYWORD предприятия ‘enterprises’

Keyword предприятия has the rank 9. As usual predicate, this lexeme has as an obligatory valency the production that the enterprise is developing or producing. In sci-tech and economic texts, this word is very frequent. The analysis of its attributes reveals the potential valencies of this word in its terminological role more precisely.

The following rubrics are proposed for the attributes:

Priduction (destiny): авиаремонтные, авиатранспортные, авиационные, автомобильные, авторемонтные, автотранспортные, агропромышленные, алюминиевые, вагоноремонтные, конверсионные, межотраслевые, многоотраслевые,... (more than 80% of all attributes at this keyword)

Stage of production and consuming cycle for the production: лизинговые, научно-производственные, оптовые, опытные, проектные, производственные, разрабатывающие, сборочные, эксплуатационные,...

Owner: акционированные, арендные, государственно-акционерные, государственные, единоличные, зависимые, зарубежные, иностранные, кооперативные, местные, муниципальные, национализированные, отечественные, подпольные, приватизированные,...

Size: базовые, большие, градообразующие, карликовые, крупнейшие, крупные, малые, мелкие, мощные, небольшие, огромные,...

Interrelation with other enterprises: встроенные, головные, дочерние, интегрированные, подчиненные,...

Readiness to function: банкротные, вводимые, действующие, ликвидируемые, новые, проектируемые,...

Efficacy: безнадежные, безубыточные, выгодные, доходные, нерентабельные, неэффективные, низкорентабельные, образцово-показательные, отсталые, отстающие, передовые, привлекательные, прибыльные, рентабельные,...

The given rubrics are rather comprehensive, but even after taking into account quantifying words (все, всевозможные, любые, многие, многочисленные, различные...) some attributes stay unclassified – автоматизированные, опасные, фиктивные... They characterize those diverse properties of enterprises (technological, ecological, juridical, etc.) that appear in texts rather seldom. Once more, it should be placed in the rubric Miscellanies.

KEYWORDS поступки ‘actions’ и  поступок ‘action’

Keywords поступки and поступок have the ranks 22 and 28. From the viewpoint of lexical semantics, поступок is the name of predicate with abstract meaning, without strictly defined set of actant (with exception of a subject fulfilling actions). As to the set of attributes, this predicate is an argument of other, assessing predicates. One of them can be defined as correspondence of the action to the norms or human community and reasonable behavior. Another assessing predicate has as its values characteristic features of the method by which this action was done.

Within the subgroup of attributes corresponding (or not corresponding) to the mentioned norms, an additional (though rather approximate) division according to types of fulfilled of violated norms is possible, namely, of norms of morals and laws, everyday way of life, and reasonable behavior. Totally, the rubrics are chosen here in the following way.

Correspondence of the action to norms of

·        moral and law: аморальный, беззаконный, безнравственный, бескорыстный, беспринципный, бессердечный, бесстыдный, бесчеловечный, бесчестный, благородный, возмутительный, героический, гуманный, добрый, достойный, жестокий, злой, лицемерный, коварный, мерзкий, красивый, моральный, мужественный, наказуемый,...

·        everyday way of life: бестактный, бесцеремонный, джентльменский, дипломатический, заурядный, естественный, мальчишеский, неджентльменский, необъяснимый, нетактичный, обыденный, тактичный,...

·        reasonable behavior: благоразумный, глупый, дикий, дурацкий, искренний, логичный, намеренный, легкомысленный, нелогичный, необдуманный, оправданный, разумный, серьезный,...

Method of doing of action: взрывной, впечатляющий, запоздалый, импульсивный, убедительный, яркий,...

Once more, the quantified attributes form a separate group.

KEYWORDS цены ‘prices’ and цена ‘price’

The keywords цена and цены have not so hig ranks (for цена 52, and for цены 60), but they are remarkable from viewpoint of reasonable classification.

Value (level) of prices

·        for buyer: баснословные, безбожные, безумные, божеские, вздутые, высокие, дискриминационные, доступные, крайние, недоступные, низкие, ничтожные, подходящие, сумасшедшие, сходные,...

·        for seller: конкурентоспособные, крайние, наилучшие, поощрительные, реальные, смешные, справедливые, сходные, хорошие,...

·        for detached observer: высокие, демпинговые, дискриминационные, конкурентоспособные, низкие, справедливые, средние, хорошие,...

Scope: внешнеторговые, внутрифирменные, договорные, заводские, зональные, закупочные, импортные, картельные, коммерческие, легальные, мировые, монопольные, нетто, оптовые, отпускные, подпольные, прейскурантные, расчетные, розничные, рыночные, сезонные, трансфертные,...

Variability of prices within their local and temporal scope: единые, падающие, плавающие, повышенные, пониженные, постоянные, растущие, свободные, сниженные, сопоставимые, стабильные, твердые, устойчивые, фиксированные,...

The sets of attributes for three groups selected are intensely intersect, but it is not reasonable to unite them. Indeed, only a detached observer can call given prices dump, while for a buyer they could be low or reasonable, and for a seller fair, honest or premium. We now that not without reason two different notions exist: seller’s price and buyer’s price.

UNIVERSALITY OF RUBRICS

Though only eight lexemes (11 wordforms) were investigated in detail, these are very fertile nouns and our database already contains totally about 4,000 attributive collocation for them. Let us now demonstrate that rubrics used above for rather limited number of words, suit for other words as well, within the group leading in fertility.

·        Since люди ‘people’ in Russian is plural for человек, wheas high-rank words like женщина /женщины ‘woman/ women’, ребенок/ребята ‘child/children’, мальчик/мальчики ‘boy/boys’, девочка/девочки ‘girl/girls’ differ from человек only by parameter of sex and/or age, all rubrics for человек are immediately transferred to this wordforms as well.

·        For keywords показатель ‘index’ and показатели ‘indexes’  with ranks 7 and 8, it is not difficult to define three rubrics, just the same as for покрытие, They are practically semantic valencies, namely: a parameter estimated by the index (агробиологический, акустический, ананомический, антифрикционный, аэродинамический,...), estimate of the parameter (беспрецедентный, внушительный, высокий, низкий,...), and a method of essessment (абсолютный, агрегатный, аналитический, базовый, важнейший, выходной, главный, интегральный,...).

·        For the keyword взгляд1 ‘look, glance’ with the rank 11, we easily state as the rubrics the emotion expressed with the glance (безжизненный, безмятежный, безразличный, благодарный, блудливый, вожделенный,...) or way the glance is made (бегающий, блуждающий, быстрый, внимательный, живой, застывший, искоса,...). The classification is similar to that for вид1, especially in emotional aspect, where the attributes are mainly the same.

As we may see, the rubrics already introduced possess intra-language universality. We are far from the opinion that these rubrics are sufficient, but are ready to see their necessity.

Since we deal now with semantics, it is not difficult to prove the applicability of the selected rubrics to other languages as well. Let us apply, for example, the rubrics for цена/цены to their English analogs price/prices, with corresponding English attributes.

Value

·        for buyer: attractive, bargain, dear, exorbitant, attractive, fabulous, fair, fancy, heavy, outrageous, outside, prohibitive, ransom, reasonable, smart, soaring, staggering, steep, tall, ungodly, unreasonable...

·        for seller: asked, bed-rock, best, bottom, competitive, fair, give-away, good, honest, handsome, nominal, popular, premium...

·        for detached observer: buying, discriminative, dump, extra high, good round, great, high, low, moderate, pegged...

Scope: administered, agreed, all-in, all-inclusive, asking, base, blanket, buying, carry-over, cash, ceiling, close, consumer, contract, cost, current, going, export, import, inclusive, leading, list, marked, market, net, original, prepublication, present, purchase, put-up, redemption, resale, reserve, retail, sale, scarcity, selling, set, short, spot, start, target, tender, trade, trigger, unit, upset, volume, wholesale...

Variability within the scope: determined, dropping, growing, fixed, flat, inflated, oscillating, pegged, reduced, standard, stiff...

Thus, the proposed rubrics are rather universal, both within a single language and while passing from one language to another.

DO RUBRICS FORM A UNITED SYSTEM?

Let us answer now, what are the introduced rubrics from semantic point of view.

At the upper level of hierarchy, all studied noun are divide to two large semantic classes: living beings (in overwhelming majority - humans) and lifeless entities that can be the names of predicates (among them organizations) or terms (artifacts, products, etc.).

Living beings are characterized with features of social, behavior, moral, intellectual, and physical aspects. For them, it is necessary to introduce morals and laws, everyday way of life, and behavior. They have an actual emotional and physical state, a point of view (opinion) and many other things.

For lifeless objects, the rubrics have at least the following alternatives:

·        Active semantic valencies of the given predicative noun, namely: subject (agent), object (patient), owner, production, objectives, way of functioning, material to be used, tool to be used, structure, scope the time and space. Generally, some of these roles are circonstants, from the viewpoint of standard lexical semantics. However, the authors of the work [7] have proposed to name typical circonstants of widespread technical terms as their frame actants and deal with them in the language processing just as with usual semantic valencies. This approach seems quite adequate.

·        Passive semantic valencies, namely: size, efficacy in reaching the objectives and readiness to function – for organizations, the main property, and other important properties – for products.

·        Passive co-valencies, which can be illustrated by the example of connection of price with buyer and seller. All three of them are co-subordinated to the predicate selling, while the sets of attributes for price heavily depend on opinion of two other participant of the situation, as well as of a detached observer. The latter is not the direct participant of situation, but is a potential buyer or seller, and gives its own attributes from the side.

Measurable parameters are characterized by qualitative estimates of their values. Many entities have their own scope, in time and space, and the variability of these values within the scope can be assessed by words too.

Parameters having numeric measure or only two possible values can be allotted at an axis (a scale), where separate attributes are different points. For example, attributes characterizing Prosperity of a human can be ordered with some approximation as нищий ’poverty-stricken’, бедный, небогатый, из среднего класса, зажиточный, обеспеченный, состоятельный, богатый, сверхбогатый ‘superrich’.

Just the rubrics defined on the scales should contain lexical functions introduces by I. Mel’čuk: Magn ‘large, intensive’, Bon ‘good’, and Ver ‘as should be’, as well as their antonyms AntiMagn, AntiBon, AntiVer [8]. But the fertile nouns under investigations are so multilateral entities, that to define just these axes for each of them is very difficult, if possible. They pass unnoticed among all others. Really, we do not know what properties of  человек ‘man’(it is usual word) or of the term покрытие should be selected as Magn, Bon и Ver.

It is easy to judge that partial rubrics introduces by us do not fit to a scheme of a typical sci-tech thesaurus. Indeed, artifacts (technical products) usually appeared, and the main semantic relation between thesaurus entries is genus – species. As to the relations revealed above, they are much richer.

However, all these rubrics and in rather similar formulas can be seen and the most developed thesauri of natural languages, among them the Roget’s should be named first [5]. It is already 150 years old, but it remains the most popular for English, since it was modernized several times. The hierarchy of entries and notions is imperfect there, since the set of ideas, which are operated by a human, cannot be related by genus – species links only. Indeed, is not clear, how to include in the general hierarchy the set of abstract notions characterizing semantic roles like subject, object, goal, method, etc.

Our rubrics were selected according to the principle of comprehensibility to potential user, and hence for purely semantic objectives usual words were taken, with all their disadvantages: diffuse definitions, synonymy, and homonymy. In a application, we should admit the same disadvantages for the whole set of rubric titles.

For example, while selecting the title Moral features within classification for humans, some vacillations are inevitable among synonymous variants: Moral aspects or Moral characteristics. It means that such or even more broad group of synonyms with the dominant Moral features should be contained in the system, so that a user can find the corresponding rubrics starting from any synonymous option first coming to his/her brain.

It is rather difficult to demand from a user without any linguistic background to know scientifically constructed terms. He could not comprehend what term from the two compared ones is broader. This means that it is necessary to have in the system not only synonymy groups, but partial hierarchies formed from those groups.

DISADVANTAGES OF THE CLASSIFICATION

The analysis carried out proves that idea to split the set of attributive collocations for any noun to subgroups, supplying each of them with subtitle motivated by a common to this group semantic element, is realizable. However, this idea has many disadvantages.

·        The creation of a complete system of rubrics for all attributes is rather easier than of a thesaurus for this natural language similar to that by P. Roget. However, such a task is quite realizable for several hundreds most fertile nouns (in the sense of combinability).

·        Classification of attributes has to be carried out on different grounds, for different nouns. For example, the rubrics are connected with actants, for one noun and with classes of properties, for another.

·        To make co-subordinated rubrics fully independent is usually not possible. The typical example is given by rubrics Character and Temper for keyword man. It is not clear if these notions are intersecting synonyms, two different subrubrics of a single rubric (as it was taken by us) or subrubrics of two different rubrics, Intellectual features and Physical features. As a satisfactory criterion of splitting to rubrics, the admissibility of parallel use at a keyword of attributes of different rubrics can be taken, but this criterion cannot be managed, strictly and in full scale.

·        Even when the given rubrics seem rather independent, the groups of their attributes can intersect, i.e., the same attribute should be inserted to two or more rubrics at once. For example, the attributes high and low at prices can be equally used by buyer, seller, and detached observer.

·        Only few nouns admit one-level (flat) classification, see контроль and покрытия. As to the majority, the reasonable classification contains two or more levels. This variability seems to be irremovable, since heavily depends on a specific noun.

·        Humanly comprehensible names can be given not to all rubrics. For example, we cannot manage with “everyday” title for quantifying, determining and ordinal adjectives.

·        Sometimes, attributes do not fit to any reasonable rubrics. A separate rubric can be allotted to each of them, or all of them can be included to a Miscellanies rubric. Corresponding examples were already given. However, availability of such rubric in a hierarchy considered as a classification failure.

·        The search within the united hierarchy of rubrics should admit synonymy, since to select uniquely possible title for each rubric is practically impossible.

We may see that the disadvantages of the classification method proposed can be scarcely removed, but even in its proposed version, though not perfect, it seems useful.

CONCLUSIONS

A method of classification of attributive collocations in databases is proposed basing on the semantic interpretation of words to be combined.

In practical aspect, the rubrics in the databases of collocation speeds up the search of necessary collocation in the large lists, that earlier can be ordered only lexicographically.

In theoretical aspect, the proposed method of classification can be considered as an additional method of linguistic portraying words. The more numerous are collocations, the more precise is lexicographic portrait based on these collocations.

For sci-tech terms, such portraying aids also to reveal the frame actants, i.e., empirical valencies observed with a high frequency at the given predicate term in texts of the corresponding narrow subject area.

REFERENCES

1.      Benson, M., et al. The BBI Combinatory Dictionary of English. John Benjamin Publ., Amsterdam / Philadelphia, 1989.

2.      Calzolari, N., R. Bindi. Acquisition of Lexical Information from a Large Textual Italian Corpus. Proc. COLING-90, Helsinki, 1990.

3.      Bolshakov, I. A. Multifunction dictionary – thesaurus for automatic preparation of Russian texts // Nauch.-Techn. Inf., ser. 2. – 1994, N 1, P. 11-23.

4.      Bolshakov, I.A. Multifunctional Thesaurus for Russian Word Processing. Proc. 4th Conf. on Applied Natural Language Processing, Stuttgart, 1994, P. 200-202.

5.      Roget’s International Thesaurus. Fifth edition. HarperCollins Publ. 1992.

6.      Bolshakov I. A., A. F. Gelbukh. Separate presentation of combinability of nouns in singular and plural // Proceedingd of International Workshop on Computational Linguistics and its Applications Dialog'95, Kazan, Russia, 1995.

7.      Apresian, Yu. D. Lexical semantics. Synonymic means of language. 2nd ed., “Yazyki Russkoy Kultury”, Vostochnaya Literatura Publ. 1995

8.      Tsinman, L. L., V. G. Sizov. Government pattern of a word, frame actants and linguistic engineering // Semiotika i Informatika. – 1998. – No. 36.- P. 154-166

9.      Zholkovsky, A. K., I. A. Mel’čuk. On semantic synthesis // Problemy Kibernetiki. – 1967. V. 19. – P. 117-238.

 

 



*The preliminary version of this paper was published under the title “Рубрикация словосочетаний в базах данных по элементам толкования сочетаемых слов” (Rubrification of word combinations in the databases by elements of meaning of combined words) in the journal Nauchnaya i Tekhnicheskaya Informatsiya (NTI), Ser. 2, No. 2, 2000.