ppt_cd.pptx ppt on phases of compiler of jntuk syllabus

padmajagrandhe1 39 views 118 slides Sep 17, 2024
Slide 1
Slide 1 of 118
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118

About This Presentation

Compiler design phases with explanation of all 6 phases


Slide Content

Heartily welcome to III CSE Students

Compiler Design III-I Sem. By Padmaja Grandhe Associate professor CSE Department PSCMRCET

Objective Understand the basic concept of compiler design, and its different phases which will be helpful to construct new tools like LEX, YACC, etc

TEXT BOOKS: Compilers, Principles Techniques and Tools. Alfred V Aho , Monical S. Lam, Ravi Sethi Jeffery D. Ullman,2nd edition,pearson,2007 Compiler Design K.Muneeswaran , OXFORD Principles of compiler design,2nd edition, Nandhini Prasad, Elsebier .

REFERENCE BOOKS: Compiler Construction, Principles and practice, Kenneth C Louden , CENGAGE Implementations of Compiler, A New approach to Compilers including the algebraic methods, Yunlinsu ,SPRINGER

UNIT – I Introduction Language Processing, Structure of a compiler the evaluation of Programming language The Science of building a Compiler application of Compiler Technology. Lexical Analysis-: The role of lexical analysis buffering, specification of tokens. Recognitions of tokens the lexical analyzer generator lex tool.

UNIT –II Syntax Analysis The Role of a parser Context free Grammars Writing A grammar Top down passing bottom up parsing Introduction Lr Parser

UNIT –III More Powerful LR parser (LR1, LALR) Using Ambiguous Grammars Equal Recovery in Lr parser Syntax Directed Transactions Definition, Evolution order of SDTS Application of SDTS. Syntax Directed Translation Schemes.

UNIT – IV Intermediated Code Generation Variants of Syntax trees 3 Address code Types and Deceleration Translation of Expressions Type Checking. Controlled Flow Back patching

UNIT – V Code generation Stack allocation of space access to Non Local data on the stack Heap Management code generation – Issues in design of code generation the target Language Address in the target code Basic blocks and Flow graphs. A Simple Code generation.

UNIT –VI Code Optimization Machine Independent Optimization. The principle sources of Optimization peep hole Optimization Introduction to Data flow Analysis.

1.Language Translator It is a program that translates one language to another Language. Ex:-compiler, assembler, interpreter, linker, loader and preprocessor .

Language processing System

1.1Preprocessor A preprocessor produce input to compilers. Macro processing File inclusion: include header files into the program text. Language Extensions:- add capabilities to the language by certain amounts to build-in macro #define, #include

1.2Compiler  is a software which converts a program written in high level language (Source Language) to low level language (Object/Target/Machine Language). shows errors Prolog , C,ML, LISP

1.3Assembler An assembler is a type of computer program that interprets software programs written in assembly language into machine language

1.4. LINK-EDITOR (Linker) &LOADER : Usually a longer program is divided into smaller subprograms called modules. And these modules must be combined to execute the program. The process of combining the modules is done by the linker. In high level languages, some built in header files or libraries are stored. These libraries are predefined and these contain basic functions which are essential for executing the program. These functions are linked to the libraries by a program called Linker.

Loader is responsible for loading programs and libraries. It is one of the essential stages in the process of executing a program, as it places programs into memory and prepares them for execution. The final output is Absolute machine code that is loaded into a computer fixed storage location and may not be relocated during execution of a computer program. 1.4LINK-EDITOR (Linker) &LOADER Contd..

1.5Examples of Compilers: 1.Ada compilers 2 .ALGOL compilers 3 .BASIC compilers 4 .C# compilers 5 .C compilers 6 .C++ compilers 7 .COBOL compilers 8 .Common Lisp compilers 9. Fortran compilers 10Pascal compilers 11. PL/I compilers 12. Java compilers

1.6Interpreter & Compiler

Short answer questions What is the purpose of Loader/Linker in language processing? List any 4 compilers and 2 interpreters you know. What is a preprocessor ? Mention its objectives. List the Differences between interpreter and compiler?

1.6 Interpreter and compiler differences Compiler Interpreter A compiler is a program which coverts the entire source code of a programming language into executable machine code for a CPU. interpreter takes a source program and runs it line by line, translating each line as it comes to it. Compiler takes large amount of time to analyze the entire source code but the overall execution time of the program is comparatively faster. Interpreter takes less amount of time to analyze the source code but the overall execution time of the program is slower. Compiler generates the error message only after scanning the whole program, so debugging is comparatively hard as the error can be present any where in the program. Its Debugging is easier as it continues translating the program until the error is met Generates intermediate object code. No intermediate object code is generated. C, C++, ML, LISP PYTHON,PERL

Questions given in university End Exams Describe the functionality of compilers in a typical language processing system What are program translators? Explain.

2.STRUCTURE OF THE COMPILER DESIGN / PHASES OF A COMPILER A compiler operates in phases. A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation. 2 phases of compilation. Analysis Phase or Front-end (Machine Independent / Language Dependent) Synthesis Phase or Back-end (Machine Dependent/ Language independent)

Analysis Phase Synthesis Phase

2.1Lexical Analysis Or scanning Lexical analysis performed by Lexical analyzer or Scanner It is the first phase of the compiler. It gets input from the source program and produces tokens as output. It reads the characters one by one, starting from left to right and forms the tokens. Token : It represents a logically cohesive sequence of characters ex:-keywords, operators, identifiers, special symbols etc Example: a + b = 20 a, b,+,=,20 are all separate tokens.

2.1Lexical Analysis contd.. Group of characters forming a token is called the Lexeme . The lexical analyser not only generates a token but also enters the lexeme & its information into the symbol table it is not already there. Num1=900

2.2Syntax Analysis:- Is performed by syntax analyzer or parser. It is the second phase of the compiler. It is also known as parser. It gets the token stream as input from the lexical analyser of the compiler , analyses the syntactical structure of the given input using grammar rules specified.  Reports the syntax errors if any..otherwise Generates Parse Tree  or syntax tree as the output if no errors are presented. Syntax tree : It is a tree in which interior nodes are operators and exterior nodes are operands.

2.2Syntax Analysis contd.. Example: For a= b+c *2, syntax tree is < id,2> = < id,1> < id,3 > + 2 * Id1=id2+id3*2

2.3Semantic Analysis Semantic Analysis is the third phase of  Compiler , performed by semantic analyzer. Semantic Analysis makes sure that declarations and statements of program are semantically correct. Semantic errors are reported C=0 A=b/c Semantic Analyzer: It uses syntax tree and symbol table to check whether the given program is semantically consistent with language definition. It gathers type information and stores it in either syntax tree or symbol table.

2.3Semantic Analysis CONTD.. Semantic Errors: examples Type mismatch, Undeclared variables, Reserved identifier misuse Functions of Semantic Analysis: Type Checking – Ensures that data types are used in a way consistent with their definition. Label Checking – A program should contain labels references. Flow Control Check – Keeps a check that control structures are used in a proper manner.(example: no break statement outside a loop) Implicit Type conversion or coercion A= b+c *60 if a,b,c are float automatically 60 is converted as 60.0

< id,2> = < id,1> < id,3 > + intoreal * 2

2.4.Intermediate code generation Performed by intermediate code generator. In the analysis-synthesis model of a compiler, the front end of a compiler translates a source program into an independent intermediate code, then the back end of the compiler uses this intermediate code to generate the target code. Commonly used intermediate codes are Three address code , postfix notation

2.4.Intermediate code generation contd.. A statement involving no more than three references(two for operands and one for result) is known as three address statement. A sequence of three address statements is known as three address code. Ex:- three address code for A= b+c *60 T1= intoreal (60) T2=<id3>*T1 T3=<id2> + T2 <id1>=T3

2.5. Code Optimization:- It is the fifth phase of the compiler. It gets the intermediate code as input and produces optimized intermediate code as output. This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine code will result. During the code optimization, the result of the program is not affected. To improve the code generation, the optimization involves - deduction and removal of dead code (unreachable code). - calculation of constants in expressions and terms. collapsing of repeated expression into temporary string. - loop unrolling. - moving code outside the loop. removal of unwanted temporary variables .

2.6. Code Generation:- It is the final phase of the compiler. It gets input from code optimization phase and produces the target code or object code as result. Intermediate instructions are translated into a sequence of machine instructions that perform the same task. The code generation involves - allocation of register and memory - generation of correct references - generation of correct data types -generation of missing code.

2.7. Symbol Table Management (or) Book-keeping:- Symbol table is used to store all the information about identifiers used in the program. It is a data structure containing a record for each identifier, with fields for the attributes of the identifier. It allows to find the record for each identifier quickly and to store or retrieve data from that record. Whenever an identifier is detected in any of the phases, it is stored in the symbol table.

2.8.Error Handlers:- Each phase can encounter errors. After detecting an error, a phase must handle the error so that compilation can proceed. In lexical analysis, errors occur in separation of tokens. In syntax analysis, errors occur during construction of syntax tree. In semantic analysis, errors occur when the compiler detects constructs with right syntactic structure but no meaning and during type conversion. In code optimization, errors occur when the result is affected by the optimization. In code generation, it shows error when code is missing etc

Position= initial+rate *60 Output of lexical analyzer is <id1>=<id2>+<id3>60 Out put of syntax analyzer is

Out put of semantic analyzer is Out put of intermediate code generation is T1= Intofloat (60) T2=id3*T1 T3=Id2+t2 Id1=t3

Out put of code optimizer is T1= intoreal (60) T2=<id3>*T1 T3=<id2> + T2 <id1>=T3 T1=<id3>*60.0 T2=<id2>+T1 <id1>=T2 <id1>=<id2>+T1

Output of Code optimizer is T1=id3*60.0 Id1=id2+t1 Output of Code Generator is LDF destination, source---loads given floating source value in to destination STF destination, source --- stores given source in to destination ADDF result,op1,op2---result=op1+op2 Operands should be in registers or value directly for operations except load and store LDF r1,id2 LDF r2,id3 MULF r2,r2,60.0 ADDF r1,r1,r2 STF id1,r1

Assignment Conversion of infix to postfix notation process Differentiate between token, lexeme and pattern with examples. What happens in Analysis and Synthesis phases of compilation? What is the key difference between lexical analysis and parsing?

Pattern: is a rule associated with token that describes the set of strings ex:-Pi Lexeme is Pi Token <id1> Pattern is (letter)(letter/digit)*

3.Applications of Compiler Technology   Implementation of High-Level Programming Languages Optimizations for Computer Architectures Design Design of New Computer Architectures Program Translations of New Computer Architectures Software Productivity Tools

3.1.Implementation of High-Level Programming Languages A high-level programming language defines a programming abstraction: the programmer expresses an algorithm using the language, and the compiler must translate that program to the target language. Higher-level programming languages are easier to program in, but are less efficient, that is, the target programs run more slowly. low-level language have more control over a computation and can, in principle, produce more efficient code lower-level programs are harder to write and — worse still — less portable, more prone to errors, and harder to maintain. Optimizing compilers include techniques to improve the performance of generated code, thus offsetting the inefficiency introduced by high-level abstractions.

C—1970 C++-1985 Java-1995 Data flow optimizations introduced Object orientation was first introduced in Simula in 1967, and has been incorporated in languages such as Smalltalk, C + + , C # , and Java. The key ideas behind object orientation are  Data abstraction and  Inheritance of properties, compiler optimization Easy to program because of reusability and modularity concepts

3.2.Optimizations for Computer Architectures The rapid evolution of computer architectures has also led to demand for new compiler technology. Almost all high-performance systems take advantage of the same two basic techniques:   parallelism  and  memory hierarchies.  Parallelism can be found at several levels: At the   instruction level ,  where multiple operations are executed simultaneously at the  processor   level ,  where different threads of the same application are run on different processors. Memory hierarchies are a response to the basic limitation that we can build very fast storage or very large storage, both fast and large. Secondary memory, primary memory, cache memory

 3.3 Design of New Computer Architectures In the early days of computer architecture design, compilers were developed after the machines were built. That has changed. Since programming in high-level languages is the norm, the performance of a computer system is determined not by its raw speed but also by how well compilers can exploit its features. Thus, in modern computer architecture development, compilers are developed in the processor-design stage, and compiled code, running on simulators, is used to evaluate the proposed architectural features. RISC Reduced instruction set computer, CISC(X86 ) Most general-purpose processor architectures, including PowerPC, SPARC, MIPS, Alpha, and PA-RISC, are based on the RISC concept.

3.4. Program Translations While we normally think of compiling as a translation from a high-level language to the machine level, the same technology can be applied to translate between different kinds of languages. The following are some of the important applications of program-translation techniques. Binary  Translation Hardware    Synthesis Database  Query  Interpreters Compiled     Simulation

Binary translations Compiler technology can be used to translate the binary code for one machine to that of another, allowing a machine to run programs originally compiled for another instruction set. Binary translation technology has been used by various computer companies to increase the availability of software for their machines. T ransmeta Crusoe processor is a VLIW processor that relies on binary translation to convert x86 code into native VLIW code. Binary translation can also be used to provide backward compatibility. When the processor in the Apple Macintosh was changed from the Motorola MC 68040 to the PowerPC in 1994, binary translation was used to allow PowerPC processors run legacy MC 68040 code.

Hardware    Synthesis Hardware designs are mostly described in high-level hardware description languages like Verilog and VHDL (Very high-speed integrated circuit Hardware Description Language). Hardware designs are typically described at the register transfer level (RTL), where variables represent registers and expressions represent combinational logic . Hardware-synthesis tools translate RTL descriptions automatically into gates, which are then mapped to transistors and eventually to a physical layout.

Database  Query  Interpreters Data base Query languages For example, query languages, especially SQL (Structured Query Language), are used to search databases. Database queries consist of predicates containing relational and boolean operators. They can be interpreted or compiled into commands to search a database for records satisfying that predicate.

Compiled     Simulation Simulation is a general technique used in many scientific and engineering disciplines to understand a phenomenon or to validate a design. Inputs to a simulator usually include the description of the design and specific input parameters for that particular simulation run. Simulations can be very expensive. each experiment may take days to complete on a high-performance machine. Instead of writing a simulator that interprets the design, it is faster to compile the design to produce machine code that simulates that particular design natively. Compiled simulation can run orders of magnitude faster than an interpreter-based approach. Compiled simulation is used in many state-of-the-art tools that simulate designs written in Verilog or VHDL .

3.5. Software Productivity Tools errors are rampant in programs; errors may crash a system, produce wrong results, render a system vulnerable to security attacks, or even lead to catastrophic failures in critical systems. Testing is the primary technique for locating errors in programs. An interesting and promising complementary approach is to use data-flow analysis to locate errors statically The problem of finding all program errors is undecidable . A data flow analysis may be designed to warn the programmers of all possible statements violating a particular category of errors.

Optimized compiler includes Type   Checking A= b+c A[50] Bounds   Checking(array out of bounds, buffer overflow) Memory – Management Tools(garbage)

4.The Evolution of Programming Languages The first electronic computers appeared in the 1940's and were programmed in machine language by sequences of O's and l's that explicitly told the computer what operations to execute and in what order. The operations themselves were very low level: move data from one location to another, add the contents of two registers, compare two values, and so on. This kind of programming was slow, tedious, and error prone. And once written, the programs were hard to understand and modify.

4.1.The Move to Higher-level Languages mnemonic assembly languages in the early 1950's . Initially, the instructions in an assembly language were just mnemonic representations of machine instructions. Later, macro instructions were added to assembly languages so that a programmer could define parameterized shorthands for frequently used sequences of machine instructions.

A major step towards higher-level languages was made in the latter half of the 1950's with the development of Fortran for scientific computation, Cobol for business data processing, Lisp for symbolic computation . Programmers could more easily write numerical computations, business applications, and symbolic programs. These languages were so successful that they are still in use today.

Today, there are thousands of programming languages. They can be classified in a variety of ways. One classification is by generation . First-generation languages are the machine languages second-generation the assembly languages third-generation the higher-level languages like Fortran, Cobol, Lisp, C, C + + , C # , and Java. Fourth-generation languages are languages designed for specific applications like NOMAD for report generation, SQL for database queries, and Postscript for text formatting. The term fifth-generation language has been applied to logic- and constraint-based languages like Prolog and OPS5.

Another classification of languages uses the term imperative for languages in which a program specifies how a computation is to be done declarative for languages in which a program specifies what computation is to be done. Languages such as C, C + + , C # , and Java are imperative languages. In imperative languages there is a notion of program state and statements that change the state. Functional languages such as ML and Haskell and constraint logic languages such as Prolog are often considered to be declarative languages .

An  object-oriented language   is one that supports object-oriented programming, a programming style in which a program consists of a collection of objects that interact with one another. Simula 67 and Smalltalk are the earliest major object-oriented Languages. Languages such as C + + ,  C # , Java, and Ruby are more recent object-oriented languages. Scripting languages   are interpreted languages with high-level operators designed for "gluing together" computations. These computations were originally called "scripts." Awk , JavaScript, Perl, PHP, Python, Ruby , and Tel are popular examples of scripting languages. Programs written in scripting languages are often much shorter than equivalent programs written in languages like C.

5. The Science of Building a Compiler A compiler must accept all source programs that conform to the specification of the language the set of source programs is infinit e and any program can be very large , consisting of possibly millions of lines of code . Any transformation performed by the compiler while translating a source program must preserve the meaning of the program being compiled. compiler development challenging.

5.1. Modelling in Compiler Design and Implementation The study of compilers is mainly a study of how we design the right mathematical models and choose the right algorithms , while balancing the need for generality and power against simplicity and efficiency. The fundamental models are finite-state machines and regular expressions . These models are useful for describing the lexical units of programs (keywords, identifiers, and such) and for describing the algorithms used by the compiler to recognize those units. Other models such as context-free grammars , used to describe the syntactic structure of programming languages . ex:-The nesting of parentheses or control constructs. Similarly, trees a re an important model for representing the structure of programs and their translation into object code.

5.2. The Science of Code Optimization The term "optimization" in compiler design refers to the attempts that a compiler makes to produce code that is more efficient than the obvious code. Compiler optimizations must meet the following design objectives: The optimization must be correct, that is, preserve the meaning of the compiled program. The optimization must improve the performance of many programs. The compilation time must be kept reasonable and it should be correct. The second goal is that the compiler must be effective in improving the performance of many input programs. Normally, performance means the speed of the program execution . Other factors like size of code-Embedded systems, less Power consumption-Mobile devices

Third, we need to keep the compilation time short to support a rapid development and debugging cycle . Finally, a compiler is a complex system . we must keep the system simple to assure that the engineering and maintenance costs of the compiler are manageable. There is an infinite number of program optimizations that we could implement, and it takes a nontrivial amount of effort to create a correct and effective optimization. We must prioritize the optimizations , implementing only those that lead to the greatest benefits on source programs encountered in practice.

Lexical Analysis-: The role of lexical analysis buffering specification of tokens. Recognitions of tokens lexical analyzer generator lex tool.

6.The role of lexical analysis buffering The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis. the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. In some cases, information regarding the kind of identifier may be read from the symbol table by the lexical analyzer to assist it in determining the proper token it must pass to the parser.

parser call the lexical analyzer by getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produces token for it &returns to the parser. Scanning & generation of tokens lexical analyzer performs certain other tasks besides identification of lexemes. stripping out comments and  whitespace  (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input). correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer may keep track of the number of newline characters seen, so it can associate a line number with each error message. In some compilers, the lexical analyzer makes a copy of the source program with the error messages inserted at the appropriate positions.

# inclue <\ int abc A= b+c

The role of lexical analysis buffering contd.. Lexical Analysis Versus Parsing Tokens, Patterns, and Lexemes Attributes for Tokens Lexical Errors

1. Lexical Analysis Versus Parsing There are a number of reasons why lexical analysis and parsing separated 1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. 2. Compiler efficiency is improved . A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition , specialized buffering techniques for reading input characters can speed up the compiler significantly. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

6.2.Tokens, Patterns, and Lexemes Token : It represents a logically cohesive sequence of characters ex:-keywords, operators, identifiers, special symbols etc Group of characters forming a token is called the Lexeme . Pattern: is a rule associated with token that describes the set of strings num1= b+c 1awe= a+b <id,1>=<id,2>+<id,3>

6.3. Attributes for Tokens When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched. For example, the pattern for token  number  matches both 0 and 1 , but it is extremely important for the code generator to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer returns to the parser not only a token name, but an attribute value that describes the lexeme represented by the token; the token name influences parsing decisions, while the attribute value influences translation of tokens after the parse. Normally, information about an identifier — e.g., its lexeme, its type , and the location at which it is first found (in case an error message about that identifier must be issued) — is kept in the symbol table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that identifier.

6.4. Lexical Errors fi ( a == f ( x ) ) . . . a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler — probably the parser in this case — handle an error due to transposition of the letters. However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is " panic mode " recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left. This recovery technique may confuse the parser, but in an interactive computing environment it may be quite adequate.

Other possible error-recovery actions are: Delete one character from the remaining input. Insert a missing character into the remaining input.  Replace a character by another character.  Transpose two adjacent characters.  Transformations like these may be tried in an attempt to repair the input. The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by a single transformation. This strategy makes sense, since in practice most lexical errors involve a single character. A more general correction strategy is to find the smallest number of transformations needed to convert the source program into one that consists only of valid lexemes , but this approach is considered too expensive in practice to be worth the effort.

7.Input Buffering 7.1. Buffer Pairs An important scheme involves two buffers or buffer divided in to 2 parts, that are alternately reloaded, as suggested in Fig. 3.3. Each buffer is of the same size  N,  and  N  is usually the size of a disk block, e.g., 4096 bytes. Using one system read command we can read  N  characters into a buffer , rather than using one system call per character. If fewer than  N  characters remain in the input file, then a special character, represented by  eof .

Two pointers to the input are maintained: Pointer  lexemeBegin ,  marks the beginning of the current lexeme, whose extent we are attempting to determine. Pointer  forward   scans ahead until a pattern match is found. Once the next lexeme is determined,  forward  is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser,  lexemeBegin  is set to the character immediately after the lexeme just found. In Fig.  3.3,  we see  forward  has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be retracted one position to its left. Advancing  forward  requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move  forward  to the beginning of the newly loaded buffer.

https://www.slideshare.net/dattatraygandhmal/input-buffering N Bytes N Bytes First Half Second Half

Abc = pqr *xyz A B C = P Lexeme beginning Forward ABC token as ID = operator Q r * X Y

2. Sentinels We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a  sentinel  character at the end. The sentinel is a special character that  cannot be part of the source program, and a natural choice is the character  eof . Figure  3.4  shows the same arrangement as Fig.  3.3,  but with the sentinels added. Note that  eof  retains its use as a marker for the end of the entire input. Any  eof  that appears other than at the end of a buffer means that the input is at an end.

8.Specification of Tokens 1. Strings and Languages 2. Operations on Languages 3. Regular Expressions 4. Regular Definitions

1. Strings and Languages An  alphabet  is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuations. The set {0,1} is the  binary alphabet. The following string-related terms are commonly used: A  prefix   of string  s  is any string obtained by removing zero or more symbols from the end of  s.  For example, ban, banana, and e are prefixes of banana.  A   suffix  of string  s  is any string obtained by removing zero or more symbols from the beginning of  s.  For example, nana, banana, and e are suffixes of banana . A  substring   of  s  is obtained by deleting any prefix and any suffix from  s.  For instance, banana, nan , and e are substrings of banana.

The  proper  prefixes, suffixes, and substrings of a string  s  are those, prefixes, suffixes, and substrings, respectively, of  s  that are not e or not equal to  s  itself. A  subsequence of s  is  any  string formed by  deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana.  If x and y are strings, then the concatenation of x and y, denoted xy , is the string formed by appending y to x. For example, if x = dog and y = house, then xy — doghouse. The empty string is the identity under concatenation; that is, for any string s, es = se=s.

2. Operations on Languages operations on languages are union, concatenation, and closure, Union is the familiar operation on sets. The concatenation of languages is all strings formed by taking a string from the first language and a string from the second language, in all possible ways, and on concatenating them.  The ( Kleene )closure of a language L, denoted L*, is the set of strings you get by concatenating L zero or more times.  Finally, the positive closure,     denoted L+, is the same as the Kleene closure, but without the term L°. That is, e will not be in L+ unless it is in L itself.

Example 3.3 i Let L be the set of letters {A, B , . . . , Z, a, b , . . . , z} and let D be the set of digits { 0 , 1 , . . . 9} . We may think of L and D in two, essentially equivalent, ways.   L  U  D--- the language   with 62 strings of length one, each of which strings is either one letter or one digit. LD --- is the set c-f 520 strings of length two, each consisting of one letter   followed by one digit.  L 4   --- is the set of all 4-letter strings.   L*  --- is the set of all strings of letters, including e, the empty string. L(L  U  D)* --- is the set of all strings of letters and digits beginning with a   letter. D +    ---  is the set of all strings of one or more digits.

Regular Expression The language accepted by finite automata can be easily described by simple expressions called Regular Expressions. The languages accepted by some regular expression are referred to as Regular languages. A regular expression can also be described as a sequence of pattern that defines a string. Regular expressions are used to match character combinations in strings. String searching algorithm used this pattern to find the operations on a string.

I N D U C T I O N :  There are four parts to the induction whereby larger regular   expressions are built from smaller ones. Suppose r and  s  are regular expressions denoting languages  L(r)  and  L(s),  respectively. 1. (r)|(s)  is a regular expression denoting the language  L(r)  U  L(s). 2.  (r)(s)  is  a regular expression denoting the language  L(r)L(s). (r)* is a regular expression denoting (L(r))*.

Regular Definitions For notational convenience, we may wish to give names to certain regular expressions and use those names in subsequent expressions, as if the names were themselves symbols. If £ is an alphabet of basic symbols, then a  regular definition  is a sequence of definitions of the form: D1->a D2-> jk D3->a*  Each di is a new symbol, not in E and not the same as any other of the cTs ,

C identifiers are strings of letters, digits, and underscores. Here is a regular definition for the language of C identifiers. We shall conventionally use italics for the symbols defined in regular definitions. letter- -> A | B | - - - | Z | a | b | - - - | z | Digit->0|1|2|...9 Identi ->letter( letter|digit |_)*

Write the regular definition for representing number Number->[0-9] + Write the regular definition for representing floating number 95.89,.987 Digit->[0-9] Number->digit + Fnumber -> Number.Number 0.987

Unsigned numbers  (integer or floating point)  are strings such as 5280, 0.01234, 6.336E4, or 1.89E-4.  The regular definition

Unsigned numbers  (integer or floating point)  are strings such as 5280, 0.01234, 6.336E4, or 1.89E-4.  The regular definition digit->0|1|2|3..9 number->digit(digit)* Opfrac -> e|.number Opexp ->E(+|-|e) number|e Unum->number( opfrac )( opexp )

Recognition of Tokens Transition diagrams   have a collection of nodes or circles, called  states.  Each   state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns Initial state, Final and non final states Final state-accept state Edges are directed from one state of the transition diagram to another. Each edge is labelled by a symbol or set of symbols. 

Initial State Final State < > <= >= = <> A<b

The Lexical-Analyzer Generator Lex Tool  Use of Lex Structure of Lex Programs Conflict Resolution in Lex The Lookahead Operator

1. Use of  Lex

Structure of Lex Programs   Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows: { definitions }    %%    { rules }    %%    { user subroutines }  

Definitions  include declarations of constant, variable and regular definitions. Rules  define the statement of form p1 {action1} p2 {action2}.... pn {action}. Where  pi  describes the regular expression and  action1  describes the actions what action the lexical analyzer should take when pattern pi matches a lexeme. User subroutines  are auxiliary procedures needed by the actions. The subroutine can be loaded with the lexical analyzer and compiled separately.

Conflict Resolution in Lex The two rules that Lex uses to decide on the proper lexeme to select, when several prefixes of the input match one or more patterns: Rule1: Always prefer a longer prefix to a shorter prefix.  Ex:- > = b+c Ge Rule2: If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the Lex program. Ex:-if

4.   The Lookahead Operator Lex automatically reads one character ahead of the last character that forms the selected lexeme, and then retracts the input so only the lexeme itself is consumed from the input. However, sometimes, we want a certain pattern to be matched to the input only when it is followed by a certain other characters. If so, we may use the slash in a pattern to indicate the end of the part of the pattern that matches the lexeme.

Rthjj 6878jjnkjk [A-Za-z0-9]* i ++

/* lex program to count number of words*/ %{ #include< stdio.h > #include< string.h > int i = 0; %} /* Rules Section*/ %% ([a-zA-Z0-9])* { i ++;} /* Rule for counting number of words*/ "\n" { printf ("%d\n", i ); i = 0;} %%

int yywrap (void){} int main() { // The function that starts the analysis yylex (); return 0; } Function yywrap is called by lex   when input is exhausted . Return 1 if you are done or 0 if more processing is required. Every C program requires a main function. In this case we simply call yylex that is the main entry-point for lex .

/* Lex program to Identify and Count  Positive and Negative Numbers */  %{ int positive_no = 0, negative_no = 0; %}    % ^[-][0-9]+ { negative_no ++;               printf ("negative number = %s\n",                    yytext );}  [0-9]+ { positive_no ++;          printf ("positive number = %s\n",                    yytext );}   %% 

/*** use code section ***/   int yywrap (){} int main()                                               {     yylex ();  printf ("number of positive numbers = %d,"         "number of negative numbers = %d\n",                  positive_no , negative_no );    return 0;  } yytext  holds the text matched by the current token. 

Regular expression to denote any string starting with a. Where ∑={ a,b } a(a/b)* Regular expression to denote any string ending with a. Where ∑={ a,b } (a/b)*a Regular expression to denote any string begining with a and ending with b. Where ∑={ a,b } A(a/b)*b

All strings of lowercase letters [a-z]* All strings of lowercase letters in which the letters in are in ascending lexicographic order. String a*b*c*d*…………z* All strings of a’s and b’s with an even number of a’s (b* ab * ab *)*

All strings of lowercase letters that contain the five vowels in order. Letter ->[b-d f-h j-n p-t v-z] String-> ( Letter|a )* ( Letter|e )* ( Letter|i )* ( Letter|o )* ( Letter|u )* Describe the languages denoted by the following regular expressions: ( i ) ( a|b )*a( a|b )( a|b . (ii) a* ba * ba * ba *

With a suitable transition diagram, explain recognition of keywords and identifiers.