Hadoop File system (HDFS)

PrashantGupta82 10,524 views 54 slides Sep 17, 2017
Slide 1
Slide 1 of 54
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54

About This Presentation

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and bl...


Slide Content

Hadoop HDFS

Wh a t is B ig Data? Limitat i ons of th e existin g solutions S olving th e problem wit h Hadoop I nt r odu c tio n to Hadoop Hadoo p Ec o - S ystem Hadoo p Core Compon e nts HDF S A rchit e c tu r e Anato m y of a File W rite and R ead T opic s o f the D a y Slid e 2

Wh a t is Big D at a? Slid e 3 Lo t s of Da t a ( T e r aby te s or P etab y tes) B ig da ta is th e t e r m fo r a co llecti o n of da ta s ets s o l a r ge a n d comple x tha t it becomes difficul t to p r ocess u s ing on - hand da t a b a se manage m ent tool s or t r adit io n al da ta p r oces sing a ppli c ations. Th e cha l l e ng es includ e c a pture, cu r atio n , sto r age, searc h, shari n g, t r ansf e r , analysis, and vi sual i zat i o n . cloud t oo l s s t a ti s tics No SQL c omp r es s ion s t o r a g e su p port d a t a b a se anal y z e i n f orm a tion t e r a b y t es mobile p r ocessing B ig Data

Wh a t is Big D at a? Slid e 4 S ystem s / Enterp rise s gene r ate hu ge amou nt of data f rom T e r aby te s t o and e ve n P etab y te s of in f ormati on. NYS E gene r ates a bout one te r abyte of new t r ade d a ta per d a y to per f orm st o c k t r ading analytics to dete rmin e trend s f or o ptimal t r ades.

Un – Structu r ed D at a is Exploding Slid e 5

IB M ’ s D e finit i o n – B i g Da ta Ch a r a c te ri s tics h t tp : / / w w w - 1.ibm . c o m/ s o f t w ar e / d a t a / big d a t a / W eb logs Vi d e os I mages A u d ios S ens or Da t a Slid e 6 Volume Velo city Variety IBM ’ s D e finiti o n

W e b and e-tailing R ecommenda t ion Engines Ad T a r geti ng Se a r c h Quali t y Abu s e and C lick F r aud Detecti o n T ele c ommunications Customer Churn P r e vent ion Networ k P er f orman c e Op tim izat i on Ca l ling Dat a R eco rd (CDR) Anal ysis Analyzing Networ k t o Pr edic t F ailu r e Commo n Big D a t a Cu s t ome r Scenarios Slid e 7

Go v ernment F r aud Detecti o n and Cyb e r Securi t y W el f a r e Sc hemes Justice Healthcar e & Li f e S ciences Healt h In f ormati on E xcha n ge Ge ne Se qu enci ng Se rial i zat i on Healthcar e Se rv ic e Quali t y I mpr o v ements Dr u g Sa f e t y Commo n Big D a t a Cu s t ome r Scenario s (Co n t d .) Slid e 8

Commo n Big D a t a Cu s t ome r Scenario s Co n t d .) Slid e 9 Banks and Financia l servi c es Modeli n g T ru e Risk Thr ea t Anal ysis F r aud Detection T r ade Sur vei l lan ce C r edi t Sc oring and Anal ysis R etail P oi nt of Sale s T r ansactio n Anal ysis Customer Churn Anal ysis Senti ment Anal ysis

Hidde n T r ea s u r e Cas e Study : Sear s Hold ing Cor p o r a t i o n X *S e ars w as using t r adi ti onal system s suc h as Or acle Exadat a, T e r adat a and S AS etc . t o store and pr ocess the customer activ i t y and s ales dat a. Slid e 17 Insight into data ca n p rovide B usin es s Ad v ant ag e . Som e k e y earl y ind icat or s ca n m e an F ort un e s t o B usin es s . Mo r e P r ecis e Anal ysi s wit h more dat a.

Mostly Append Co llecti o n Instrumentat ion h t t p:/ /ww w .in f ormati on w eek.c om/i t - leade r shi p /w hy - s ea rs- i s - goin g -a l l - in - on - hadoop/d/d - id/11 07 3 8 ? 90% of th e ~ 2PB Arch i ved St o rage Proce ssing B I R ep o rt s + Inte r acti v e App s RDBMS (Aggre g ated Data) ETL Compute Grid 3. Prematur e d a ta death 1. Can’t explore original high fide lity raw data 2. Mo v ing data to comput e d o esn’t scale A meagre 10% of the ~ 2PB Dat a is available for BI Sto r age only Grid (original R a w Da t a) Limi ta tions o f Exi s ting D at a Anal ytics A r chi t ectu r e Slid e 11

*S e ars mo v e d t o a 3 00 - Nod e Had oop cluste r t o kee p 100 % of its dat a a v ailable f or p rocessing r ather than a meag re 10% as w as the case with existing No n- Had oop sol u ti ons. RDBMS (Aggr egate d Da t a) N o Data Arch i vi n g 1. Dat a Exploration & Advan ce d analyti c s 2. Scala b le thr ou g h p ut f o r ETL & aggr e gati o n 3. Keep data alive forever Mostly Append Co llecti o n Instrumentat ion B I R ep o rt s + Inte r acti v e App s Had oop : St o r age + Comp ute Grid Entire ~ 2PB Dat a is available for processing Both St o rage And Proce ssing Solution : A Combine d S t o r a g e Comp u t er L a y er Slid e 12

Ac c ess ibl e R obu s t S impl e S c alabl e Dif fere nti ating Fa c t o rs Hadoo p Di f f e r e n ti a ting F ac t o r s Slid e 13

RDBMS EDW Hadoo p – I t ’ s about S c ale and Structu r e Slid e 14 MPP NoSQL HADOOP Structure d Da ta T yp es Multi and U nstru cture d Limited, N o Da ta P r ocessing P r ocessing P r ocessing co upl e d w ith Data Stan da r ds & Str u cture d G ov erna n ce Loosely Str u cture d R equ i r e d On w rite Sc hema R equ i r e d On R ead Sof t w a r e Li cense Cost Sup p ort Only Known Enti t y R esourc es G r owing, C omplexiti es , Wide Inte r acti v e O L AP Analytics Op e r ational Da ta Sto r e B es t Fi t Use Mas s i v e Sto r age/P r ocessing

Rea d 1 T B Data 1 Machine 4 I/O Chann e ls Each Chann e l – 100 MB/s 10 Machines 4 I/O Chann e ls Each Chann e l – 100 MB/s W h y D F S? Slid e 15

Rea d 1 T B Data 10 Machines 4 I/O Chann e ls Each Chann e l – 100 MB/s W h y D F S? Slid e 16 45 Minutes 1 Machine 4 I/O Chann e ls Each Chann e l – 100 MB/s

Rea d 1 T B Data 10 Machines 4 I/O Chann e ls Each Chann e l – 100 MB/s W h y D F S? Slid e 17 45 Minutes 4 .5 Minutes 1 Machine 4 I/O Chann e ls Each Chann e l – 100 MB/s

Wh a t is Hadoop? Slid e 18 Apache Had oop is a f r amework tha t allow s f or th e distri buted pr ocessing of la r ge dat a set s ac r oss cluster s of commodi t y comp uter s using a simp le p r og r amming model. It is an Op e n - so ur c e Da t a Manag emen t wit h scal e- ou t st o r age & distrib uted processing.

Apache Oozie ( W orkflow) Map R edu c e F r amewo rk HBase HDF S ( Hadoo p Distribute d Fi le S ystem) F lum e S q oop I mpor t Or Export Pi g Latin Dat a Analysis Ma hout Ma chin e Learni n g Hive D W Sys t em Unstructured or S e mi - Structure d data Structure d Data Hadoo p E c o- S y st em Slid e 19

Dat a Node T ask T r ac k er Dat a Node T ask T r ac k er Dat a Node T ask T r ac k er Dat a Node T ask T r ac k er MapReduce Engi n e HDFS C l ust er Job T r ac k er Admin N o de Nam e no de Hadoo p Co r e Compone n ts Slid e 20

Hadoo p Co r e Compone n ts Slid e 21 Had oop is a syste m f or la r ge scal e dat a p r ocessing . It has tw o main c o mpo nent s: HDF S – Had oop D istri buted Fi l e S yste m (Sto r age) Distributed ac r oss “nodes” Nati vely r edu n dan t NameNod e t r acks locati o ns. Map R educ e (P r ocessing) Spli t s a tas k ac r oss pr ocessors “nea r ” the data & a s sembles r esults Sel f - Healin g, Hi gh B and width Cluste r e d sto r age Job T r ac k e r manag e s th e T as k T r ac k ers

HDFS Definition Slid e 22 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. It has many similarities with existing distributed file systems. Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets HDFS consists of following components (daemons) Name Node Data Node Secondary Name Node

HDFS Components Slid e 23 Namenode : NameNode , a master server, manages the file system namespace and regulates access to files by clients. Maintains and Manages the blocks which are present on the datanode . Meta-data in Memory – The entire metadata is in main memory Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. Etc

HDFS Components Slid e 24 Data Node: Slaves which are deployed on each mechine and provide the actual storage. DataNodes , one per node in the cluster, manages storage attached to the nodes that they run on A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients – Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes

Understanding the File system Block placement Current Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed Clients read from nearest replica Data Correctness Use Checksums to validate data Use CRC32 File Creation Client computes checksum per 512 byte DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas

Understanding the File system Data pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file Rebalancer Goal: % of disk occupied on Datanodes should be similar Usually run when new Datanodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool

Se c onda r y NameNode Slid e 27 Se condary NameNod e: N ot a hot sta ndby f or th e N am e N ode Conn ect s t o NameNod e e v er y hour* Ho u se keepi ng, back up of N emeN od e metadata S av e d m eta data ca n build a f ailed N ame N od e

Metadata (Name, r eplicas ,…) R ack 1 Block s Da taNodes B loc k ops R eplication W rite Da taNodes Me tadat a ops Cl ient R ead N ame No d e R ack 2 Cl ient HD F S A r chi t ectu r e Slid e 28

N ame No d e Data N od e D a t a N od e Data N od e D a t a N od e 2. C r eate 7 . Compl ete 5. ack P ac k et 4 . W rit e P ac k et Pipeli n e of Da t a no des Data N od e D a t a N od e Distributed Fil e S ystem HDFS Client 1. C r eate 3. W rite 6. Close An at o m y o f A File W ri t e Slid e 29

N ame No d e N ame No d e Data N od e D a t a N od e Data N od e D a t a N od e 2. Ge t B loc k locations 4 . R ead Data N od e D a t a N od e Distributed Fil e S ystem HDFS Client 1. Open 3. R ead 5. R ead F S Data I nput Stream 6 . Close Cl ient JVM Cl ient N o de An at o m y o f A File R ead Slid e 30

R epli c a tion and Rack A w a r eness Slid e 31

T a sk T r a c k er T a sk T r a c k er T a sk T r a c k er T a sk T r a c k er Da ta N o d e T a sk T r a c k er C li ent H DFS Na me Node T a sk T r a c k er T a sk T r a c k er Da ta N o d e Da ta N o d e M a p Red u ce Job T r ac k er Hado o p Cluster Architecture (Contd.) Da ta N o d e Da ta N o d e T a sk T r a c k er D at a No d e Da ta N o d e Da ta N o d e

D a ta N ode RAM: 1 6 GB H a r d dis k : 6 X 2TB P r ocess o r: X e no n w ith 2 c or es. E t h er n et : 3 X 10 G B/ s OS : 64 -bit C e n tOS N ame N ode RA M: 64 GB , Har d disk: 1 TB P ro c essor : X e n on wit h 8 C or e s E therne t : 3 X 10 GB/s O S : 64- bit CentOS Pow e r : R e du n da n t Pow e r Su pply Second a ry N ame N ode R AM: 32 G B , H a r d dis k : 1 TB P r ocess o r: X e no n w ith 4 C or es E t h er n et : 3 X 10 G B/ s OS : 64 -bit C e n tOS P o w e r : R e dund a nt P o w er S up p ly D a ta N ode RAM: 1 6 GB H a r d dis k : 6 X 2TB P r ocess o r: X e no n w ith 2 c or es. E t h er n et : 3 X 10 G B/ s OS : 6 4 -bit C e n tOS Hado o p Cluster: A Typical Use Case

h t tp : / /wiki.ap a ch e . o rg /ha d o o p / P o w e r e dBy Hado o p Cluster: Facebook

Ha d o op ca n ru n in a n y o f th e f ollo w ing th r ee mode s: S t a ndalone (or Local) M ode N o d a e mons, e v e rythi n g ru n s in a s ingle JVM. S uit a ble f or ru nnin g M a p R educe p r o g r a ms during de v e lo p me nt. H a s no DF S . P s e u do - D i s t r ibu t e d M ode H a d o op d a e mons ru n on the loc a l m a chi n e . Fu l l y - D i s t r i but ed M ode H a d o op d a e mons ru n on a clu s t er of m a chi n es . Hado o p Cluster Modes

Co r e HD F S co r e-site.xml hd f s-site . x ml m a p re d -site . x ml Map R e duce Hado o p 1.x: Co r e Co n figu r a tion Files

Hado o p Co n figu r a tion Files Configu r a ti o n Fi l en a mes Descri p ti o n of Log F iles h a d o o p - e n v .sh E n vi ro nm e nt v a ria bles t h a t a r e u s ed in t he s crip t s t o ru n H a d o o p . cor e - site.xml C o nfigu r at ion se t t ings f or H a d o op C o r e such a s I / O se t t ings t h a t a r e co mm o n t o H DF S a nd M a p R educe. hd f s - site.xml C o nfigu r at ion se t t ings f or H DF S d a emons, t he n a meno d e , t he se co nd a r y n a meno d e a nd t he d at a no d es. m a p r e d- site.xml C o nfigu r at ion se t t ings f or M a p R educe d a emons : t he j o b- t r a c k er a nd t he ta s k - t r a c k e r s. m a s t e rs A li s t o f m a chi n es (one per line) t h a t e a c h ru n a s e co nd a r y n a meno d e . sl a v es A li s t o f m a chi n es (one per line) t h a t e a c h ru n a data no d e a nd a ta s k - t r a c k e r .

c o r e - si t e.xml and hd f s - si t e.xml h d f s - site.x m l co r e - site.x m l <?xml v e r sion - " 1.0" ? > <?xml v e r sion ="1.0" ? > < !-- hd f s - site.xm l -- > < !--cor e - site.xml -- > <configu r at ion> <configu r at ion> <p r o p e r t y> <pr o per t y> <n a me > df s . r eplic atio n </name> <n a me > f s .d e fault.name </name> < v a lu e > 1 </ v a lu e > < v a lu e > hdf s :/ / localhost:802 / </ v a lu e > </p r o p e r t y> </p r o p e r t y> </c o nfigu r at ion> < /co nfigu r at ion>

D e finin g HD F S D e t ails In hd f s -si t e.xml P r op e r t y V a lue Descri p ti o n d f s. data . d ir < v a lu e > / d isk 1 / hd f s / data , / d isk 2 / hd f s / dat a </ v a lu e > A li s t o f direc t o r ies wh e r e t he d ata no d e s t o r es b lock s . E a c h b lock is st o r ed in only one of t h e se direc t o r ies. $ {ha d o o p . t m p . d ir} / d f s / dat a f s. c h e ckpo int . dir < v a lu e > / d isk 1 / hd f s / n a mese co nd a r y , / d isk 2 / hd f s / n a mese co nd a ry </ v a lu e > A li s t o f direc t o r ies wh e r e t he se co nd a ry n a meno d e s t o r es c h e ckpo ints. It s t o r es a co py of t he ch e ckpo int in e a c h d ire c t o r y in t he li s t $ {ha d o o p . t m p . d ir} / d f s/n a mese co nd a ry

map r ed - si t e.xml m a p r e d- sit e . xml <?xml v ersion=“1.0” ? > <configu r at ion> <pr o per t y> <n a me > m apr e d.j ob . t r ack e r </name> < v a lu e> localhost:802 1 </ v a lu e > <pr o per t y> </c o nfigu r at ion>

D e finin g map r ed - si t e.xml P r op e r t y V a lue Descri p ti o n m a pred. j o b . t r a c k er < v a lu e > l oc a lhos t: 8 21 </ v a lu e > T he hos t n a me a nd t he p o r t t h a t t he j o b t r a c k er RPC se r v er ru n s on. I f set t o t he de fa ult v a lue of l oc a l, t hen t he j o b t r a c k e r run s i n -pr o ce ss o n dem a nd wh en y ou ru n a M a p R educe j o b . m a p r ed . loc a l. d ir $ {ha d o o p . t m p . d ir} / m a p r e d/loc a l A li s t o f direc t o r ies wh e r e M a p R educe s t o r es inte r medi at e d at a f or j o bs. T he d at a is c le a r ed out wh en t he j o b ends. m a p r ed . sy s t em . dir $ {ha d o o p . t m p . d ir} / m a p r e d/s ys t e m T he direc t o r y r el a t i v e t o f s. d e fa ult . n a me wh e re sh ar ed f iles a r e st o r ed, during a j ob ru n . m a p r ed .ta skt r a c k e r . m a p . ta sk s . m a ximum 2 T he number of m a p ta sks t h a t m a y be ru n on a ta skt r a c k er a t an y one t ime m a p r ed .ta skt r a c k e r . r educe. ta sks . m a ximum 2 T he number of r educe ta sks ta t m a y be ru n on a ta skt r a c k er a t an y one t ime.

h t tp:/ / h a d o o p.a pache.o r g / d o cs/ r 1.1 . 2 / c o r e - de f a ult . h t ml h t tp:/ / h a d o o p.a pache.o r g / d o cs/ r 1.1 . 2 / m a p r e d - de f a ult . h t ml h t tp:/ / h a d o o p.a pache.o r g / d o cs/ r 1.1 . 2 / hd f s- d e f a ult . h t ml All P r operties

M a sters C o nt a ins a lis t o f host s , one p er lin e , th a t a r e to hos t S e c o ndary Na m e Node s er v er s. Sl a v es and Ma s t e r s Two fil e s are use d by t h e star t u p an d shut d ow n co m mands: Sl av es C o nt a ins a lis t o f host s , one p er lin e , th a t a r e to hos t D ataNo d e a nd Ta s k T r ack e r se r v e r s .

T h is fil e a lso o ff e r s a w a y t o p r o vide cust om pa r a mete r s f or e a c h o f th e ser v e rs . Ha d o o p - e n v .s h is s ou r c e d by a ll o f th e Ha d o o p Co r e scr i pts p r o vi d ed in t h e c o n f / di r e c t o r y o f th e inst a ll a ti o n. E xam p l e s o f e n vironment vari a bl e s that yo u ca n specif y : e x po r t HAD OOP_D A T ANO D E_HEA P SIZE= " 1 2 8 " e x po r t HAD OOP_ T A S KTRACK E R_ H E A P SIZE= " 5 1 2 " Se t pa ramete r JA V A_H O M E J VM hado o p -e n v . s h P e r - P r oces s Ru n time E n vi r on m e n t

N a meNode st a tus : h t tp:/ / l o ca lho st : 500 7 / d f sh ea lth . jsp J o b T r a c k er st a tus : h t tp:/ / l o ca lho st : 500 3 / job t r a c k e r . jsp T a s k T r a c k er st a tus : h t tp:/ / l o ca lho st : 5006 /tas kt r a c k e r . jsp D a t a B l o c k Sca n n er R ep o rt : h t tp:/ / l o ca lho st : 50 75 / blo ckSca n ne r R e po r t W eb UI URLs

Practice HDFS Commands

Open a terminal window to the current working directory. # /home/ notroot ----------------------------------------------------------------- # 1. Print the Hadoop version hadoop version ----------------------------------------------------------------- # 2. List the contents of the root directory in HDFS hadoop fs - ls / ----------------------------------------------------------------- # 3. Report the amount of space used and available on currently mounted filesystem hadoop fs - df hdfs :/ ----------------------------------------------------------------- # 4. Count the number of directories,files and bytes under the paths that match the specified file pattern hadoop fs -count hdfs :/

5. Run a DFS filesystem checking utility hadoop fsck – / ----------------------------------------------------------------- # 6. Run a cluster balancing utility hadoop balancer ----------------------------------------------------------------- # 7. Create a new directory named “ hadoop ” below the /user/training directory in HDFS. hadoop fs - mkdir /user/training/ hadoop ----------------------------------------------------------------- # 8. Add a sample text file from the local directory named “data” to the new directory you created in HDFS hadoop fs -put data/sample.txt /user/training/ hadoop

9. List the contents of this new directory in HDFS. hadoop fs - ls /user/training/ hadoop ----------------------------------------------------------------- # 10. Add the entire local directory called “retail” to the /user/training directory in HDFS. hadoop fs -put data/retail /user/training/ hadoop ----------------------------------------------------------------- # 11. Since /user/training is your home directory in HDFS, any command that does not have an absolute path is interpreted as relative to that directory. # The next command will therefore list your home directory, and should show the items you’ve just added there. hadoop fs - ls ----------------------------------------------------------------- # 12. See how much space this directory occupies in HDFS. hadoop fs -du -s -h hadoop /retail

13. Delete a file ‘customers’ from the “retail” directory. hadoop fs - rm hadoop /retail/customers ----------------------------------------------------------------- # 14. Ensure this file is no longer in HDFS. hadoop fs - ls hadoop /retail/customers ----------------------------------------------------------------- # 15. Delete all files from the “retail” directory using a wildcard. hadoop fs - rm hadoop /retail/* ----------------------------------------------------------------- # 16. To empty the trash hadoop fs –expunge ----------------------------------------------------------------- # 17. Finally, remove the entire retail directory and all of its contents in HDFS. hadoop fs - rm -r hadoop /retail

# 18 . Add the purchases.txt file from the local directory named “/home/training/” to the hadoop directory you created in HDFS hadoop fs - copyFromLocal /home/training/purchases.txt hadoop / ----------------------------------------------------------------- # 19. To view the contents of your text file purchases.txt which is present in your hadoop directory. hadoop fs -cat hadoop /purchases.txt ----------------------------------------------------------------- # 20. Move a directory from one location to other hadoop fs -mv hadoop apache_hadoop ----------------------------------------------------------------- # 21. Add the purchases.txt file from “ hadoop ” directory which is present in HDFS directory to the directory “data” which is present in your local directory hadoop fs - copyToLocal hadoop /purchases.txt / home/training/data

# 26. Default names of owner and group are training,training # Use ‘- chown ’ to change owner name and group name simultaneously hadoop fs - ls hadoop /purchases.txt sudo -u hdfs hadoop fs - chown root:root hadoop /purchases.txt ----------------------------------------------------------------- # 27. Default name of group is training # Use ‘- chgrp ’ command to change group name hadoop fs - ls hadoop /purchases.txt sudo -u hdfs hadoop fs - chgrp training hadoop /purchases.txt ----------------------------------------------------------------- # 28. Default replication factor to a file is 3. # Use ‘- setrep ’ command to change replication factor of a file hadoop fs - setrep -w 2 apache_hadoop /sample.txt

29. Copy a directory from one node in the cluster to another # Use ‘- distcp ’ command to copy, # Use '-overwrite' option to overwrite in an existing files # Use '-update' command to synchronize both directories hadoop fs - distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop ----------------------------------------------------------------- # 30. Command to make the name node leave safe mode hadoop fs -expunge sudo -u hdfs hdfs dfsadmin - safemode leave ----------------------------------------------------------------- # 31. List all the hadoop file system shell commands hadoop fs ----------------------------------------------------------------- # 33. Last but not least, always ask for help! hadoop fs -help

Than k Y ou