Mic r o c ont r oller and Embedded Sy s t ems Modu l e-3 C Compilers and Optimi z ation: Structure Arrange m ent, Bit-fields, Unaligned Data and Endianness, Division, Floating Point, Inline Functions and Inline Asse m bl y , Portability Issues. ARM p r ogramming using Assembly language: W riting Asse m bly code, Profiling and cycle counting, instruction scheduling, Register Allocation, Conditional Execution, Looping Constructs Laboratory Component: 1. W rite a progra m to arrange a series of 32-bit nu m bers in ascending/descending orde r . 2. W rite a progra m to count the nu m ber of ones and zeros in two consecutive m e m ory locations. 3. Display “Hello W orld” m essage using Internal UA R T . 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 1
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation S TRUCTUR E A RRANG E M E NT • The way you l ay ou t a fr e qu e ntly used s t r uct u re c an hav e a s i gnific a n t i m pac t on i t s perfor m ance a nd c ode densit y . • There are two issues concerni n g structures on the ARM: Al i gn m ent of the structure entries and The overall size of the structure. • ARM co m pilers wil l auto m aticall y align the start address of a structure to a m ult i ple of the la r gest access width used within the structure (usuall y four or eight bytes) and align entries within structure s to their access width by insert i ng padding. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 2
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation For exa m ple, consider the structure struct { char a; int b; char c; short d; } For a l i t t le-endian m e m ory sys t e m the co m piler wil l lay this out adding padding to ensure that the next object is aligned to the size of that object: 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 3
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • T o i m prove the m e m ory usage, you should reorde r the ele m ents struct { char a; char c; short d; int b; } • This reduces the structure size fro m 12 bytes to 8 bytes, with the fol l owing new layout: • Therefore, i t is a good idea to group structure ele m ents of the sa m e size, so that the structure layout doesn ’ t contain unnecessary padding. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 4
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • The armcc co m piler does include a keyword packed that re m oves all padding. • For exa m ple, the structure packed struct { char a; int b; char c; short d; } • wil l be l aid out in m e m ory as Howeve r , packed st r uc t ure s a re slow and i neffi c ie n t to access . The co m piler e m ulat e s unal i gn e d l oad a nd st o re op e rat i ons b y us i ng s e veral a l i gn e d a c cesse s wi t h data op e rat i ons t o m e r ge t he r e sul t s. Onl y u s e th e pack e d keyword where s p ace is fa r m o r e im p orta n t th a n s pe e d and you c a n ’ t r e du c e padd i ng by rearrage m ent. Also use it for porting code that assu m es a certa i n structure layout in m e m or y . 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 5
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation The e xac t lay o ut of a s t ruc t ure in m e m ory m ay depend on t he c o m pil e r vendor a nd c o m piler versi o n y ou us e . In API (Appl i c ati o n P rogra m m er Int e rfac e ) defini t i ons it i s often a g o od i dea t o i ns e rt a ny paddi n g t hat you canno t get r i d of i n t o t he s t ruct u re m anuall y . Thi s way t he s t ructure lay o ut i s no t a m bigu o us. It i s e a sier t o l i nk code between co m piler versions and co m piler vendors if you st i ck to una m biguous structure s . Another po i n t of ambi g u i ty i s e n um . Di f f e ren t c o m piler s use d i f ferent size s for an e nu m erat e d t y pe, dependi n g on the range of the enu m eration. For exa m ple, consider the t y pe t y pedef enu m { F ALSE, TRUE } Bool; 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 6
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • The armcc i n ADS1.1 wi l l t reat B o ol a s a on e -byt e t y pe a s it only us e s t he va l ues 0 a nd 1. Bool w i l l only take up 8 bits of space in a structure. • Howeve r , gcc will treat Bool as a word and take up 32 bits of space in a structure. • T o avoid a m biguity it is best to avoid using enum t y pes in structures used in the API to your code. Another c ons i derat i on i s t he si z e of th e s t ructur e a nd t he o f fse t s of e l e m ents wi t h i n t he s t ru c t u re. Th i s proble m is m ost acu t e wh e n you a re c o m pil i ng for t he Thu m b ins t ru c t i on set. Thu m b i ns t r uct i ons a re onl y 16 bit s w i de a nd s o only al l ow for s m all ele m ent o f fse t s fro m a s t r uc t ure base po i nt e r . T ab l e s hows t he l o ad a n d store base regis t er o f fsets available in Thu m b. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 7
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • Ther e fore t he co m pile r can onl y acces s an 8-bit s t r uc t ur e ele m ent wi t h a s i ngl e i n s t ruct i on i f i t appears with i n t he fi rs t 32 bytes of t he st r uc t ure. S i m i l arl y , si n gle ins t ruct i ons c a n onl y acce s s 16-bit va l ues i n the firs t 64 bytes and 32-b i t va l ues i n t he fi rs t 128 bytes. Once you e xce e d t hese l i m i t s, st r uc t ure acc e sses beco m e inefficient. • The following rules generate a structure with the ele m ents packed for m axi m u m efficiency: Place all 8-bit ele m ents at the start of the structur e . Place all 16-bit ele m ents next, then 32-bit, then 64-bit. Place all arrays and la r ger ele m ents at the end of the structure. If the structure is too big for a single instructi o n to access all t he ele m ents, then group the ele m ents into substructure s . The co m piler can m aintain pointe r s to the individua l substructu r es. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 8
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation SUMMA R Y Efficient Structu r e Arrangement • Lay structures out i n ord e r of increasing ele m ent size. Start the structure with the smallest ele m ents and finish with the la r gest. • A void very la r ge structures. Instead use a hierarchy of s m aller structures. • For portabilit y , m anually a dd padding (that would appear i m plicitly) into API s tructures so that the layout of the structure does not depend on the co m pile r . • Beware of using enu m types in API structures. The siz e of an enu m type is co m piler dependent. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 9
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation B I T -fiE LDS • Bit-fie l ds a re probably th e le a st st a nd a rdiz e d par t of th e ANSI C spe c ifi c ati o n. The c o m piler c a n c hoos e how bi t s a re a l locat e d wi t h i n the bi t -fie l d conta i ne r . F or t h i s re a son a l on e , avoid us i ng b i t -fie l ds i ns i de a uni o n or i n a n API s t ructure defini t i o n. Di f f e ren t co m pi l ers can as s i g n the sa m e bi t -field d i f f e rent bit posi t ions in the containe r . • It i s al s o a good i d ea to a vo i d bi t -fie l ds for e ffici e nc y . Bi t - fi elds a re struc t ure ele m ents a nd usu a l l y acce s sed us i ng s t ructure po i nt e rs; consequ e ntl y , th e y su f f e r fro m the po i nt e r a l i as i ng probl e m s. Every bi t -fie l d access i s real l y a m e m ory ac c ess. Poss i ble po i nt e r ali a si n g often force s the co m pil e r t o reload t he bi t - fi eld several t i m es. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 10
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation The following exa m ple, dostages_v1, illust r ates this proble m . It also shows that co m pilers do not tend to opti m ize bit-field testing ver y well. void dostageA(void); void dostageB(void); void dostageC(void); typedef struct { unsigned int stage A : 1; unsigned int stageB : 1; unsigned int stageC : 1; } Stages_v1; void dostages_v1(Stages_v1 *stages) { if (stages->stageA) { dostageA(); } if (stages->stageB) { dostageB(); } if (stages->stageC) { dostageC(); } } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 11
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation Here, we use three bit-field flags to enable three possible stages of process i ng. The exa m ple co m piles to 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 12 • Note that the compiler accesses the memo r y location containing the bit-field three times. Because the bit - field is sto r ed in memor y , the dostage fu n ctions could change the value. A l s o , the compiler us e s two instructions to test bit 1 and bit 2 o f the bit-field, r ather than a single instruction. • Y ou can gene r ate far more efficient code b y using an integer rather than a bit-field. • Use enum or #define masks to divi d e the integer type into di f ferent fields.
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • The following code i m ple m ents the dostages function using logical operat i on s rather than bit- fields: t y pedef unsigned long Stages_v2 ; #define S T AGE A (1ul << 0) #define S T AGEB (1ul << 1) #define S T AGEC (1ul << 2) void dostages_v 2 (S t ages_v2 *stages_v 2 ) { Stages_v2 stages = *stages_v2; if (stages & S T AGEA) { dostageA() ; } if (stages & S T AGEB) { dostageB(); } if (stages & S T AGEC) { dostageC(); } } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 13
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • Now that a single unsigned long t y pe contains all the bit-field s , we can keep a copy of their values in a single local variable stages, which re m oves the m e m ory aliasing proble m . • In other words, the co m piler m ust assu m e that the dostage X (where X is A, B, or C) functi o ns could change the value of *stages_v 2 . • The co m piler generates the fol l owi n g code giving a saving of 33% over the previou s version using ANSI bit-fields : 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 14
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • Y ou can also use the m asks to se t and clear the bit-fields, just as easil y as for test i ng the m . The following code shows how to set, clea r , or toggle bits using the S T AGE m asks: stages |= S T AGEA; /* enable stage A */ stages &= ∼ S T AGEB; /* disable stage B */ stages ∧ = S T AGEC; /* toggle stage C */ • These bit set, clea r , and toggle operat i ons take onl y one ARM instructi o n each, using ORR, BIC, and EOR instructi o ns, respectivel y . Another advantage is that you can now m anipulate several bit-fields at the sa m e t i m e, using one instruc t ion. For exa m ple: 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 15
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation Summary Bit-fields • A void using bit-fields. Instead use #define or enu m to define m ask values. • T es t , t oggl e , a nd se t bi t -fie l ds us i ng i nt e ger l o gica l AND, OR, a n d e xc l usiv e OR op e ra t i o ns wi t h t he m ask va l ues. T hese op e ra t i o ns c o m pile e f fici e n t l y , and you c a n tes t , t ogg l e, or set m ult i ple fie l ds a t t he s a m e t i m e. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 16
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation U NALIGN E D D A T A AND E NDIANN E SS • Unaligned data and endianness are two issues that can co m plicate m e m ory accesses and portabili t y . Is the array pointer aligned? Is the ARM configure d for a big-end i an or l i t t le- endian m e m ory sys t e m ? • The ARM load and store instruct i ons assu m e that the address is a m ult i ple of the t y pe y ou are loading or storing. If you load or store to an address that is not aligned to i t s t y pe, then the behavior depends on the part i cula r i m ple m entation. The core m ay generate a data abort or load a rotated value. For well-wri t ten, portable code you should avoid unaligned accesses. • C co m pilers assu m e that a pointer is aligned unless you say otherwise. If a pointer isn ’ t aligned, then the progra m m ay give unexpected resul t s. This is so m eti m es an issue when you are porting code to the ARM fro m processo r s that do allow unaligned accesses. For armcc , the packed directive tel l s the co m piler that a data i t e m can be posi t ion e d at any byte align m ent. This is useful for porting code, but using packed wil l i m pact perfor m ance. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 17
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation T o i l l us t rate t h i s, l ook a t t he fol l ow i ng si m p l e rout i ne, re a d i nt. I t r e t u rns t he int e ger at th e addr e ss po i nt e d t o by data. W e’ve used packed to tel l the co m piler that the integer m ay possib l y not be aligned. int r eadint( packed int *data) { r eturn *data; } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 18 BIC r3,r0,#3 ; r3 = d a t a & 0xFFFFFF F C AND r0,r0,#3 ; r0 = d a t a & 0x00000003 M O V r0,r0,LSL #3 ; r0 = bit of f set of d a t a w o r d LDMIA r3,{r3,r12} ; r3, r12 = 8 by t es r ead f r om r3 M O V r3,r3,LSR r0 ; These th r ee in s tructions R SB r0,r0,#0x20 ; shift the 64 bit v alue r12.r3 ORR r0,r3,r12,LSL r0 ; rig h t by r0 bits M O V pc,r14 ; r e turn r0 This compiles to readint • The c ode is la r ge and c ompl e x. • The c ompiler emul a t es the unal i gned access using t w o aligned accesses and da t a p r ocessing ope r a tions, which is v ery co s tl y . • y ou should a void _pac k ed. • In s t ead use the type char * t o poi n t t o da t a th a t c an appear a t a n y al i gnment.
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • Ali g n m ent proble m s : Wh e n re a ding dat a packet s or file s us e d t o t r a nsf e r i nfor m ati o n between c o m pu t ers. Netwo r k pack e ts a nd co m pre s sed i m age fil e s ar e g o od ex a m ples. T wo- or fou r -byte i nt e gers m ay appear at arbi t rary o f fset s in these files. Data has been squeezed as m uch as possib l e, to the detri m ent of align m ent. • Endi a nne s s (o r byte ord e r) i s also a b i g i s s ue wh e n re a d i ng dat a pa c ke t s or c o m pr e ss e d fil e s. T he ARM core c a n b e c on fi gur e d to work i n littl e - e nd i an (lea s t si g n i fican t byte a t l owes t a ddr e s s ) or b i g- e nd i an ( m ost significant byte at lowest address) m odes. Li t t l e-endian m ode is usually the default. • The endi a nn e s s of an ARM i s usuall y s e t a t powe r -up and re m ains fi xed th e reaft e r . T ables i l l u s t rate how the ARM ’ s 8-bit, 16-bit , and 32-bit load and store instructi o ns work for di f ferent endian configura t ions. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 19
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation W e a s s u m e th a t byte a ddr e ss A is a l i gn e d t o the size of t he m e m ory tr a nsf e r . T he t a bl e s show how t he byte addresses in m e m ory m ap into the 32-bit regis t er that the instructi o n loads or stores. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 20
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • What is the best way to deal with endian and alignment problems? • If speed is not critical, then use functions li k e readint_little and r eadint_ b ig in b elow example, w h ich read a fou r - b yte integer from a p ossibly unaligned address in memor y . The address alignment is n ot known at compile time, o n ly at ru n time. • If you’ve loaded a file containing big- endian data such as a JPEG image, then use readint_big. • For a bytestream containing little-endian data, use readint_little. • Both routines will work correctly regardless of the memory endianness ARM is configured fo r . Example: These functions read a 32-bit integer f r om a b ytestrea m pointed to b y d ata. The b ytestream contains little- or big-endian data, respectivel y . These functions are independent o f the ARM memory s y s tem b yte or d er si n ce they only use byte acces s es. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 21
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation If s pe e d i s c r i t i cal, t hen t he fastest a p p ro a ch i s to write sev e ral vari a n t s of t he c rit i c al rout i ne. F o r each poss i bl e al i gn m ent and A R M endi a nnes s confi g ur a t i on, you ca l l a s ep a rate rout i ne opt i m ized for t ha t si t uation. SUMMA R Y Endianness and Alignment • A void using unaligned data if you can. • Use t he t y pe ch a r * fo r da t a tha t c a n be at any byte a l i gn m ent. Acce s s t he data by re a ding byte s and co m bin i ng wi t h l o gica l op e rati o ns. T hen t he c ode won ’ t dep e nd on ali g n m ent or ARM endi a nnes s configurati o n. • For fast acc e s s to unal i gn e d s t ruc t ur e s, wri t e d i f f e ren t variants ac c ordi n g t o po i n t er al i gn m ent and processor endianness. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 22
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation D I v ISION • The AR M doe s not have a divide instruction in hardware. Instead the compiler implement s divisions b y calling software routines in the C libra r y . There are many di f fere n t types of division ro u tine that yo u can tailor t o a sp e cific range of nu m erator and deno m inator values. • The standard integer divisio n r o utine provided in the C lib r ary can ta k e betw e en 20 and 100 cycles, d epending on implementation, early termination, and the ranges of the input operands. • Division and modulus (/ and %) are such sl o w o perations that you should avoid them as much as possible. Howeve r , division by a constant and repeated division by the same denominator can be handled efficientl y . • This section de s cribes how to replace certain d ivisions b y multiplications and ho w to minimize the number of divis i on calls. • Circular bu f fer s are o n e area w here prog r am m ers ofte n us e division, b u t yo u can avoi d thes e divisions completel y . Suppose you h ave a circ u lar bu f fer o f size bu f fe r _size b ytes and a position indicated b y a bu f fer o f fset. T o advance the o f fset by increment bytes you could write offset = (offset + inc r ement) % buffer_si z e ; • 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 23
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation Instead i t is far m ore efficient to wri t e offset += inc r ement; if (off s e t >=buffer_s i z e) { offset -= buffer_s i z e; } • The first version m ay take 50 c y cles; the second will take 3 c y cles because it does not involve a division. • If you c a n ’ t a void a d i vi s i o n, t hen t ry to a r range t hat the nu m era t or and deno m inat o r a re uns i gned integers . • Sign e d d i v i s i on rout i nes a re slower s i nce t hey tak e t h e a bso l ute va l ues of t h e nu m era t or a n d d e no m ina t or and then call the unsigned divis i on routine. They fix the sign of the result afterwards. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 24
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation Many C library division routines return th e quotien t and re m ainder from th e division. In other words a free re m ainder operation i s ava i lab l e t o you wit h e ach divisio n operation an d vic e versa. For ex a m ple, to find the (x, y) position of a location at o f fset bytes into a screen bu f fe r , it is te m pting to write typedef struct { int x; int y; } point; point getxy_v1(unsigned int o f fset, unsigned int bytes_per_line) { point p; p.y = o f fset / bytes_per_line; p.x = o f fset - p.y * bytes_per_line; return p; } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 25 It appea r s that w e h a v e s a v ed a division b y using a subt r act and multiply t o calcul a t e p.x, but in f ac t , it is of t en mo r e e f ficie n t t o wri t e the function w i th t he mo d ulus or r emainder ope r a tion.
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation Exa m ple: In getxy_v2, the quotient and re m ainder operation only require a single call to a division routine: point getxy_v2(unsigned int o f fset, unsigned int bytes_per_line) { point p; p.x = o f fset % bytes_per_line; p.y = o f fset / bytes_per_line; return p; } This version is four instruc t ions shorter than getxy_v1(Co m piler output-Asse m bly progra m ) 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 26
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation DIVISION • REPE A TED UNSIGNED DIVISION WITH REMAINDER • CONVE R TING DIVIDES IN T O MU L TIPLIES • UNSIGNED DIVISION B Y A CONS T ANT • SIGNED DIVISION B Y A CONS T ANT 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 27
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation REPE A TED UNSIGNED DIVISION WITH REMAINDER Often the same denominator occurs several times in code. In the prev i ous example, bytes_per_line will probab l y be fixed throughout the program. If we project from three to two cartesian coordinates, then we use the deno m inator twice: • ( x , y , z ) → ( x / z , y / z ) • In these situations it is more effic i ent to cache the value of 1/z in some way and use a multiplication by 1/z instead of a division. CONVE R TING DIVIDES IN T O MU L TIPLIES Example: Th e r o utine, s c ale , sh o w s ho w t o conv e rt divis i ons to m ult i plications in pract i c e . It divi d es an ar r a y o f N e lements by denominator d. • First calculate the value of s. • Then rep l ace each divide by d with a multiplication by s. • The 6 4-bit multiply is cheap b e c a u se the ARM has a n instruc t ion UMULL, which multiplies two 32-bit values, givin g a 6 4- bit result. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 28
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation void scale( unsigned int *dest, /* destinati o n for the scale data */ unsigned int *src, /* source unscaled data */ unsigned int d, /* deno m inator to divide by */ unsigned int N) /* data length */ { unsigne d int s = 0xFFFFFFFFul / d; /*s=(2 ^32 -1) / d*/ do { unsigned int n, q, r; n = *(src++); q = (unsigned int)(((un s igned long long)n * s) >> 32); r= n - q* d; if (r >= d) { q++; } *(dest++ ) = q; } while (--N); } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 29
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation UNSIGNED DIVISION B Y A CONS T AN T - A LGORITHM unsigned int udiv_by_const(unsigned int n, unsigned int d) { unsigned int s,k,q; /* W e assu m e d!=0 */ /* first find k such that (1 << k) <=d< (1 << (k+1)) */ for (k=0; d/2>=(1u << k); k++); if (d==1u << k) { /* we can i m ple m ent the divide with a shift */ return n >> k; } • /* d is in the range (1 << k)<d< (1 << (k+1)) */ s = (unsigned int)(((1ull << (32+k))+(1ull << k))/d); if ((unsigned long long)s*d >= (1ull << (32+k))) { /* n/d = (n*s) >> (32+k) */ q = (unsigned int)(((unsigned long long)n*s) >> 32); return q >> k; } /* n/d = (n*s+s) >> (32+k) */ q = (unsigned int)(((unsigned long long)n*s + s) >> 32); return q >> k; } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 30
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation SIGNED DIVISION B Y A CONS T ANT 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 31
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation The following routi n e, sdiv_by_con s t, shows how to divide by a signed constant d . In practice you wil l preca l culate k and s at co m pile ti m e. Only the operat i on s involvi n g n for your part i cular value of d need be executed at run t i m e. int sdiv_by_const(int n, int d) { int s, k ,q; unsigned int D; /* set D to be the absolute value of d, we assu m e d!=0 */ if (d>0) { D=(unsigned int)d; /* 1 <= D <= 0x7FFFFFF F */ } else { D=(unsigned int) - d; /* 1 <= D <= 0x80000000 */ } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 32
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation /* first find k such that (1 << k) <=D< (1 << (k+1)) */ for (k=0; D/2>=(1u << k); k++); if (D==1u << k) { /* we can i m ple m ent the divide with a shift */ q = n >> 31; /* 0 if n>0, -1 if n<0 */ q=n+ ((un s igned)q >> (32-k ) ) ; /* insert roundin g */ q = q >> k; /* divide */ if (d < 0) { q = -q; /* correct sign */ } return q; } • 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 33
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation /* Next find s in the range 0<=s<=0xFFFFFFF F */ /* Note that k he r e is one s m alle r than the k in the equation */ s = (int)(((1ull << (31+( k +1)))+(1ull <<(k+1)))/D); if (s>=0) { q = (int)(((signed long long)n*s) >> 32); } else { /* (unsigned)s = (signed)s + (1 << 32) */ q=n+ (int)(((signed long long)n*s) >> 32); } q = q >> k ; /* if n<0 then the for m ula r equi r es us to add one */ q += (unsigned)n >> 31; /* if d was negative we m ust cor r ect the sign */ if (d<0) { q = -q; } r eturn q; } 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 34
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation SUMMA R Y Division • A void divis i ons as m uch as possib l e. Do not use the m for circular bu f fer handling. • If y ou c a n ’ t a vo i d a div i s i on, th e n t ry to t a ke a d v a nt a ge of t h e fac t t hat d ivid e r outin e s of t en gen e rat e the quotient n/d and m odulus n % d togethe r . • T o r e peat e dly d i vide by t he s a m e deno m in a t o r d , cal c ulat e s = (2 k - 1)/ d i n a dv a nce. Y ou c a n repla c e t he divide of a k -bit unsigned integer by d with a 2 k -bit m ult i ply by s . • T o d i vid e uns i gn e d n < 2 N by an un s i g ned c onst a nt d , you can find a 32-bit u n si g ned s a nd s h i ft k s uch tha t n / d is ei t her ( ns ) > > ( N + k ) or ( ns + s ) > > ( N + k ). T he choic e dep e nds onl y on d . There i s a si m i l ar result for signed divisions. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 35
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation F LO A TING P OINT • The m ajorit y of ARM proc e s s or i m ple m entati o ns do not prov i de hardware fl oat i n g -p o int su p por t , whi c h sav e s on pow e r a nd a re a wh e n usi n g ARM i n a pr i ce-s e nsi t ive , e m bedded app l icat i on. W i th th e e xcept i ons of t he F l oat i n g Po i nt Accel e rator (F P A) us e d on th e ARM7500FE a nd the V ec t or F l oat i ng Po i n t accel e ra t or (VFP) hardware, the C co m piler m ust provide support for floating point in software. • In prac t ic e , t h i s m eans that t he C co m pil e r conv e rts ev e ry float i ng-poi n t oper a t i on i n t o a s ubrout i ne call. The C l i brar y c ontai n s s ubrou t ine s t o si m ula t e fl oat i ng-poi n t behav i or us i ng int e ger a ri t h m etic. Thi s c ode is wri t t en in highl y opti m ized a sse m bl y . Ev e n so, fl oat i ng-po i nt a l gori t h m s will e xe c ute f a r m ore s l owl y th a n correspo n d i ng integer algorith m s. • If y ou need f a st e xe c ut i on a nd fract i onal value s , y ou s hou l d use fix e d-p o in t or b l ock- floa t i n g a l gori t h m s. Fractional values are m ost often used when processing digi t al signals such as audio and video. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 36
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation INLINE FUNCTIONS AND INLINE ASSEMB L Y SUMMA R Y Inline Functions and Assembly • Use i nline functions to declare new operat i on s or pri m i t ives not supported by the C co m pile r . • Use i nline asse m bly to access ARM instruction s not supported by the C co m pile r . Exa m ples are coproces s or instruction s or ARMv5E extensions. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 37
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation P O R T ABILIT Y I SSU E S The l is t of issues encounter when porti n g C code to the ARM. • The cha r type . On the ARM, char is unsigned rather than signed as for many other processors. A com m on proble m concerns loops that use a char loop counter i and the continuation condition i ≥ 0, they beco m e infinite loops. In this situation, ar m cc produces a warning of unsigned co m parison with zero. Y ou should eithe r use a compile r option to ma k e cha r signed o r change loop counters to type int. • The int type . Some older architectures use a 16-bit int, which m ay cause proble m s when moving to ARM ’ s 32-bit int type although this is rare nowadays. Note that expressions are pro m oted to an int type before evaluation. Therefore if i = -0x1000, the expression i == 0xF000 is true on a 16-bit m achine but false on a 32- bit m achine. • Unaligned data pointers . So m e processors support the loading of short and int typed values fro m unaligned addresses. A C program may m anipulate pointers directly so that they beco m e unaligned, for example, by casting a char * to an int *. ARM architectures up t o ARMv5TE do no t support unaligned pointers. T o detect the m , run the program on an ARM with an alignment checking trap. For exa m ple, you can configure the ARM720 T to data abort on an unaligned access. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 38
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • Endian ass u m p tions . C cod e m ay m ake as s u m pti o ns about the e ndi a nn e ss of a m e m ory sys t e m , for exa m ple, by c asti n g a c har * to an i nt *. If you c onfig u re the ARM for the sa m e endi a nn e ss t he c od e is expe c t i ng, t hen th e re i s no is s ue. O t herw i se, y ou m us t re m ove endi a n-dependen t c ode sequences and replace the m by endian-i n dependen t ones. See Sec t ion 5.9 for m ore detai l s. • Funct i on p r o t otyp in g . The a r m cc co m pil e r pass es a r gu m ent s narro w , t hat i s, reduc e d t o th e r a ng e of the a r gu m ent t y pe. If funct i on s a re no t p r oto t y p ed c o r rect l y , t h en t he funct i on m ay re t urn t he wrong an s we r . Other co m pil e rs t hat pas s a r gu m ents wid e m ay giv e th e corre c t a nswer ev e n i f th e funct i on pro t ot y pe i s incorrect. Always use ANSI protot y pes. • Use of b i t -fie l ds . Th e layou t of b i t s wi t h i n a b i t-fie l d i s i m ple m entat i on a n d e ndi a n d e pend e nt. If C cod e assu m es that bits are laid out in a certa i n orde r , then the code is not portab l e. • Use of en u m e rations . Al t h o ugh e nu m is portab l e, d i f f e r e nt c o m pil e rs a l locat e d i f f e ren t nu m bers of b y tes to a n e nu m . The gcc c o m pi l er wi l l a lways allocat e four b y tes t o a n e nu m t y pe. The a r m cc co m piler wil l onl y al l ocate one byte i f th e enu m t a kes only eigh t -bit va l ues. The r e fo r e you c an ’ t c r oss- l in k code and l i braries between diffe r ent compilers if you use enums in an API structu r e . 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 39
Mic r o c ont r oller and Embedded Sy s t ems C Compilers and O ptimi z ation • Inlin e assem b l y . Us i ng i nl i ne a s se m bly i n C code r e duc e s por t ab i l i t y between a rchit e ctures. Y ou shou l d sep a rate a ny i nl i ne a s se m bly in t o s m all inl i ned funct i ons that c a n eas i l y be r e plac e d. It is al s o us e ful to supp l y r e f e rence, pl a in C i m ple m entat i ons of th e se funct i ons t hat c a n b e us e d on o t her ar c hitec t ures, wh e re this is possible. • The vo l atil e keyword. Us e t he volat i l e keywo r d on t he t y pe defin i t i on s of AR M m e m ory- m apped periphera l lo c a t i ons. T hi s ke y word pr e vents the c o m pil e r fro m opt i m izing away th e m e m or y acce s s. I t a lso ensures t hat t he co m pil e r generat e s a data a ccess of th e correct t y pe. F or exa m ple, i f you defin e a m e m ory locat i on as a volat i l e s hort t y pe, t hen t he c o m piler wil l a ccess it us i ng 16-bit l oad and s t ore i ns t ruct i ons LDRSH and STRH. 28-07-2023 D r Anitha D B ,CSE-DS, A TMECE,Myso r e 40