What is information extraction

Information extraction

Transcript

1 information extraction Dr. Günter Neumann DFKI GmbH 1 What is information extraction? With the rapid spread of the Internet, the problem of information overload comes more and more to the fore: The more texts are available online, the more difficult it is to use the information potential in a targeted manner, ie to find relevant information, to extract it and in compact form to represent. In order to be able to cope adequately with the information overload, feverish research is already being carried out into new technologies for future intelligent information management systems. A newly established research direction is the research and implementation of systems for information extraction (IE). The aim of the IE is the construction of systems that can specifically track down and structure domain-specific information from free texts, while at the same time skipping over irrelevant information. IE systems do not attempt a comprehensive analysis of the entire content of all text documents, but should only analyze or understand the text passages that contain relevant information. What is considered relevant is given to the system by predefined domain-specific lexicon entries or rules. This knowledge must define in as much detail and as precisely as possible which types of information are to be extracted by an IE system so that extensive and at the same time precise extraction is possible. Typically, the given information models complex, coherent response patterns regarding who, what, whom, when, where and possibly why. They are specified in the form of templates, i.e. bundles of attribute / value pairs, e.g. Company and product information, sales reports, personnel changes, job postings. The core functionality of an IE system can then be briefly characterized as follows: Input: Specification of the type of relevant information in the form of templates (set of attributes) and a set of free text documents (press releases, Internet documents, etc.) Output: a lot of instantiated templates (values ​​for attributes) that are filled with the normalized text fragments identified as relevant. 1

2 2 2 AN EXAMPLE The data extracted in this way can be used in many ways, e. B. for fine-grained text filtering or classification, as entries for databases, to support text mining and answer extraction systems, or as a starting point for a text summary. 2 An example The following example is intended to illustrate what has just been explained. We consider the task of extracting information about personnel changes from online documents. In particular, we want to know which person (PersonOut 1) left which position in which organization and when (TimeOut), and which new person (PersonIn) took up this position again when (TimeIn). The corresponding template has the following form: [PersonOut PersonIn Position Organization TimeOut TimeIn] For the following text: Dr. Hermann Wirth, previous director of the Munich Music Academy, said goodbye today. The 65-year-old is taking his well-deserved retirement. Sabine Klinger was named as his successor. The position of music director has also been filled. Annelie Häfner succeeds Christian Meindl. The result is then, for example, the filled (instantiated) template: PersonOut Dr. Hermann Wirth PersonIn Sabine Klinger Position Head of Organization Musikhochschule München TimeOut today TimeIn Here it should be noted that the entire information is distributed over two sentences, whereby in the second relevant sentence even an anaphoric expression has to be resolved (his successor). If we assume sentence-oriented processing (see Section 4), in which the relevant information per sentence is determined in a first step, then the various parts must be put together in a subsequent processing step (see Section 4). In addition, the concrete value for the TimeOut attribute is only mentioned relatively in the text (today). The value for the TimeIn attribute is explicitly not mentioned, so that it can only be derived from additional, possibly very specific assumptions, which we do not want to do for the example. Since not all attributes can be assigned values, one also speaks of a partial instance. Incidentally, the sample text also contains another template instance: 1 The attributes are highlighted in italics.

3 3 PersonOut Christian Meindl PersonIn Annelie Häfner Position Music Director Organization Musikhochschule München TimeOut TimeIn In this example it should be noted that we use the same value for the Organization attribute as in the first example. This is only possible if very specific rules are defined. Since such rules check information from the immediate vicinity (e.g. take the value of the Organization attribute from the template last filled), they are generally prone to errors. This also applies to the assignment of the two time attributes, which remain unspecified in our example. Incidentally, it would also be possible for individual attributes to have their own template structure, which would then be filled accordingly. For example, a normalized form for personal names might be: [Last Name First Name Title]. 3 Evaluation Criteria for IE The quality of an IE system is assessed using the dimensions of precision and completeness. 2 As part of the Message Understanding Conference (MUC) series, IE systems are even systematically evaluated on a competitive basis, MUC98. The precision P denotes the proportion of correctly acquired knowledge units (WE) (templates or individual attribute / value pairs) compared to the total found WE. High precision therefore means that almost all WE found are relevant. The completeness V (English recall) denotes the proportion of the correctly obtained WE compared to the total recoverable WE. A high degree of completeness therefore means that almost all relevant WE have been extracted. It is difficult to optimize both parameters at the same time: If a search is optimized for high precision, the probability increases that possibly relevant units of knowledge will not be recognized. If, on the other hand, the completeness is optimized, the risk increases that knowledge units that are irrelevant are included in the result. In order to create a comprehensive measure of the quality of the IE process, the F-measure was defined (usually β = 1 is set in the above equation): 3 F = (β2 + 1) PV β 2 P + V 2 The dimensions mentioned are also used in other applications, such as Information retrieval and text classification. 3 Essentially, the formula defines a geometric mean that can be weighted using the parameter β. The deviation from 1 determines whether P or V should be weighted more heavily.

4 4 4 A GENERIC IE SYSTEM The quality of the result depends heavily on the difficulty of the task. GS96 reported in the context of MUC-7 that for a simple task such as recognizing names, most systems achieved values ​​of over 90% for both completeness and precision (the best system achieved V = 96%, P = 97%). The most difficult task was filling in scenario templates, ie templates that describe a specific scenario, such as a terrorist attack or the example in Section 2. The results were V = 40 to 50% and P = 60 to 70% (the best system achieved V = 47%, P = 70%). AI97 show that with complex tasks the quality of the result is usually no better than 60% with regard to the F-measure. However, this is not so bad, because humans are also unable to achieve 100% completeness and precision when analyzing complex texts. A person achieves an F-measure of about 97% when recognizing proper names (see MUC98). 4 A generic IE system The example in Section 2 shows that a number of linguistic subtasks have to be solved, e.g. lexical analysis, name recognition, partial syntax analysis, reference resolution. In principle, it would be possible to use a generic text analysis system for this purpose, which carries out a complete, in-depth analysis of the meaning of an entire text. But even if it were possible to formalize the complete grammar (including lexicon) of a language and to represent it in such a system, the system would still need a high degree of robustness and efficiency in order to be able to process large amounts of free texts. However, in order to keep pace with the enormously increasing demand for language technology triggered by the Internet revolution, researchers in the field of IE are working on methods that represent a compromise between theoretical and pragmatic requirements. This has led to the development of flat word processing methods. Here, certain generic language regularities, which are known to cause complexity problems, are either not treated or treated very pragmatically, e.g. by limiting the recursion depth on the basis of a corpus analysis or by using heuristics (prefer longest possible partial chains). This engineering perspective on language processing has led to a renaissance and further development of known technologies, in particular of finite state machines or transducers in syntax analysis, RS96. In contrast to finite automata (which allow regular patterns to be recognized in an input sequence), transducers also define a relation between possible sequences of input symbols (e.g. words and their assigned lexical features) and assigned output structures (e.g. phrases in the form of a dependency structure) . Although the syntax of a natural language is at least context-free, current work in the field of IE shows that even with the simpler transducers

5 very practicable systems can be implemented. Transductors can be cascaded with one transducer operating on another's output tape. This allows a fine modularity of a system in different components: Token scanner: On the basis of regular expressions this component identifies the text structure (e.g. paragraphs, indentations, title lines) and special character strings (tokens), e.g. Dates and times, abbreviations, word boundaries and punctuation marks. Here, HTML or XML parsers can also be used to analyze appropriately marked texts. Lexical analysis: In lexical processing, a morphological analysis of the potential word forms takes place, i.e. the determination of the part of speech (POS) and the inflected form (e.g. plural or singular). Especially for processing the German language, an analysis of compound words and hyphae coordination (buying and selling) must be carried out at this level, since compound formation in German is very productive, NBB Then morphosyntactically ambiguous words (I mean my pocket) are converted using POS -Tagging disambiguated. Proper name recognition finds and normalizes special expressions such as person, company, product names, complex date, time and measure expressions. Since proper names are also very productive, both special proper names lists and machine grammars (for handling non-lexicalized expressions) are used. It is important at this level to deal with references between proper names (EN koreference) in order to determine that e.g. Federal Chancellor Schröder, G. Schröder or Schröder designate the same person in a text (see PN00). Parsing: As already explained in the introduction, most IE systems do not perform a full syntactic analysis, but a flat, fragmentary analysis that can be easily achieved with finite automata and nonambiguous grammars. The parsing task is strongly modularized through an explicit separation into phrase (NP, PP, VG) and sentence structure. This enables a strictly bottom-up controlled, cascaded parsing strategy to be implemented (also known as chunk parsing), whereby simple, non-recursive phrases are recognized first and then combined into more complex units in a next phase (e.g. NP coordination). This very modular approach allows a simple but domain-independent phrase analysis to be combined with very domain-specific rules for recognizing complex (sentence) units. Coordinate resolution: The main task is to determine whether different linguistic objects refer to the same template instance. In the context of the IE, the following co-reference problems have to be solved: 1) EN co-reference (see paragraph on proper name recognition), 2) pronominal reference, i.e. references between pronouns (he, she, etc), ENs and NPs, 3) references between designators (the Company, the Detroit automaker) and other entities (Ford). Similar to flat parsing strategies, it has been shown, especially in the field of IE research, that flat reference resolution (which does not require complete syntax processing) is already 5

6 6 5 MACHINE LEARNING PROCEDURES FOR IE gives very good results. Detection of domain-relevant patterns: This is the critical part of an IE system, as this is where the rules are defined that determine the structure of template instances. They build directly on the components mentioned above. In principle, they carry out a domain-specific collection according to the template structure known to an IE system. The rules are defined in such a way that they check features of the headers of the extracted phrases (e.g. syntactic properties, entry in the domain dictionary). An example of such a rule can be found in the next section. Most of these rules are sentence-based, i.e. they provide (partial) template instances that only refer to information in a (possibly very long) sentence. Template unification: It is clear that a single sentence does not have to contain all the information necessary to instantiate a template, but that the information can be distributed over several sentences (see Section 2). It is therefore necessary to combine information from different template instances. In general, this is a very difficult task, which in IE systems is usually handled using a simple template unification strategy: If two templates have identical information in at least one attribute, then the entire information of both templates is combined using unification. For every two type-compatible attributes, a check is made as to whether a core reference relationship exists, whether they are semantically compatible (via the domain dictionary) or whether they are in a subsumption relationship. In addition, very application-specific heuristics are used. IE systems for processing English texts still dominate. In recent years, however, IE systems have also been implemented to process other languages, e.g. Chinese, Japanese, Italian and German. Due to language-specific aspects, adjustments or extensions to the generic architecture are necessary. In the free processing of German texts, in addition to the complex composite analysis, a purely bottom-up oriented parsing is unsuitable because of the free word order, whereas a mixed top-down / bottom-up strategy is more promising, NBP00. 5 Machine Learning Processes for IE Since the last two components mentioned are very domain-specific, they have to be adapted to a new task and domain for their use. This is usually very time-consuming and usually has to be carried out by experts. Therefore, the development of machine learning processes is the focus of current IE research in order to automate the process of domain modeling as much as possible. At the end of the article, we would therefore like to briefly discuss the current state of research. The majority of the current approaches are variants of monitored inductive learning processes. Based on a training set of text documents already annotated with the results, the aim is to automatically induce rules for filling templates. To do this, the

7 7 the training set has been generalized to the template instances given so that they can also be used for non-annotated, new documents. In order to arrive at these domain-specific rules, the documents are preprocessed with the flat processing components. 4 Most current methods learn rules for filling individual attributes, newer approaches even learn rules for filling entire templates, CM98. The following example is intended to give a little insight into how an inductive IE learning process works (taken from Huf96). The starting point is the following annotated training example: png Sue Smith / png, 39, of Menlo Park, was appointed tng president / tng of cng Foo Inc. / cng where png stands for person name group, tng for title name group and cng for company name group . Together with linguistic preprocessing, the following Prolog-like template rule can be derived: noun-group (png, head (isa (person))), noun-group (tng, head (isa (title))), noun-group ( cng, head (isa (company))), prep (prep, head (of or at or by)), verb-group (vg, type (passive), head (named or elected or appointed)), subject (png, vg), object (vg, tng), post-nominal-prep (tng, prep), prep-obj (prep, cng) = management-appointment (person (png), title (tng), company (cng)). This rule consists of two parts: the left condition part Bed and the right template instantiation part T Inst. The starting point for determining Bed is the result of the linguistic preprocessing with the training set for which a phrase analysis is carried out.In the present case, the first NP, if its header element belongs to the semantic class Person, is assigned the attribute png. The properties of the verb group are immediately adopted in the rule. We have already assumed in this example that the same rule has already been derived for training examples with the verbs elected and named, so that the three rules have been converted into one. The same applies to the preposition mentioned. The current methods are already showing surprisingly good results. Huf96 reports an F-measure of 85.2% for its system. For an IE application in the area of ​​online job offers, CM98 report 87.1% P and 58.8% V. Very good results are also reported in the area of ​​multilingual recognition of proper names, Gal96; BMSW97. Most current methods still require a large amount of already annotated training material, the creation of which is very time-consuming. Therefore, new methods try to derive template rules even without annotated training material, Ril96; YGTH00. 4 This clearly shows the advantage of available domain-independent linguistic modules. Different modules are used in current processes. Fre98 only uses tokenization, CM98 also uses POS tagging, Huf96 phrase recognition and Ril96 flat sentence analysis.

8 8 Literature Literature Appelt, D. and D. Israel: Building information extraction systems. Tutorial during the 5th ANLP, Washington, appelt / ietutorial /. Bikel, D.M., S. Miller, R. Schwartz, and R. Weischedel: Nymble: a High-Performance Learning Name-finder. In: Proceedings of 5th ANLP, Washington, USA, March Califf, M. and R. Mooney: Relational Learning of Pattern-Match Rules for Information Extraction. In: Proceedings of the AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Freitag, D .: Information Extraction From HTML: Application of a General Learning Approach. In: Proceedings of the 15th AAAI, Gallippi, A .: Learning to Recognize Names Across Languages. In: 34th ACL, Santa Cruz, California, USA, Grishman, R. and B. Sundheim: Message Understanding Conference 6: A Brief History. In: Proceedings of the 16th COLING, Copenhagen, Denmark, Europe, Huffman, S .: Learning information extraction patterns from examples. In: Wermter, Riloff and Scheler (editors): Connectionist, Statistical, and Symbol Approaches to Learning for Natural Language Processing, Volume 1040 of the LNAI series, Berlin, Springer, Proceedings of the Seventh Message Understanding Conference (MUC-7), (Fairfax , VA, April 1998), Morgan Kaufmann Publishers. Neumann, G., R. Ofen, J. Baur, M. Becker and C. Braun: An Information Extraction Core System for Real World German Text Processing. In: Proceedings of 5th ANLP, Washington, USA, March Neumann, G., C. Braun and J. Piskorski: A Divide-and-Conquer Strategy for Shallow Parsing of German Free Texts. In: Proceedings of the 6th ANLP, Seattle, USA, April Piskorski, J. and G. Neumann: An Intelligent Text Extraction and Navigation System. In: Proceedings of the 6th RIAO. Paris, April Riloff, E .: Automatically Generating Extraction Patterns from Untagged Text. In: 13th AAAI, Roche, R. and Y. Schabes: Finite State Devices for Natural Language Processing. MIT Press, Cambridge MA, Yangarber, R., R. Grishman, P. Tapanainen, and S. Huttunen: Unsupervised Discovery of Scenario-Level Patterns for Information Extraction. In: Proceedings of the 6th ANLP, Seattle, USA, April 2000.