Warehouse Planning and Implementation

SHIKHAGAUTAM4 9,654 views 58 slides Apr 24, 2018
Slide 1
Slide 1 of 58
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58

About This Presentation

Data Warehouse Process and Technology: Warehousing Strategy, Warehouse management and Support Processes.

Warehouse Planning and Implementation.

H/w and O.S. for Data Warehousing, C/Server Computing Model & Data Warehousing, Parallel Processors & Cluster Systems, Distributed DBMS implementa...


Slide Content

NCS:066 Data Warehousing and Data Mining Unit-2 Shikha Gautam Asst.Professor

Topic Covered Data Warehouse Process and Technology: Warehousing Strategy, Warehouse management and Support Processes. Warehouse Planning and Implementation. H/w and O.S. for Data Warehousing, C/Server Computing Model & Data Warehousing, Parallel Processors & Cluster Systems , Distributed DBMS implementations. Warehousing Software, Warehouse Schema Design. Data Extraction, Cleanup & Transformation Tools, Warehouse Metadata

Warehousing “Storage or warehousing provides the place utility as part of logistics for any business and along with Transportation is a critical component of customer service standards”.

Reasons for warehousing To support the company’s customer policy . To maintain a source of supply without interruptions . To support changing market conditions and sudden changes in demand. To provide customers with the right mix of products at all times and all locations . To ensure least logistics cost for a desired level of customer service.

Benefits of warehousing More cost effective decision making. Better enterprise intelligence: Increasing quality and flexibility of enterprise analysis. Enhanced customer service. Business re-engineering: Knowing what information is important provides direction and priority for re-engineering efforts. Information system re-engineering.

Types of warehouses Private warehouses: It is a storage facility that is mostly owned by big companies or single manufacturing units. It is also known as proprietary  warehousing. Public warehouses: It is a facility that stores inventory for many different businesses as opposed to a " private   warehouse ”. Contract warehouses : A contract warehouse handles the shipping, receiving and storage of goods on a contract basis. This type of warehouse usually requires a client to commit to services for a particular period of time. 

Warehousing Strategy An integrated warehouse strategy focuses on two questions: H ow many warehouses should be employed. W hich warehouse types should be used to meet market requirements . Many firms utilize a combination of private, public, and contract facilities .

It involves following activities: Establish sponsorship. Identify enterprise needs. Determine measurement cycle. Validate measures. Design data warehouse architecture. Apply appropriate technologies. Implementing data warehouse. Warehousing Strategy (cont’d)

Establish sponsorship : Establishing the right sponsorship chain will ensure successful development and implementation. Sponsorship chain should include a data warehousing manager and two key individuals. Identify Enterprise needs : Interview with key enterprise manager and analysis other pertinent documentations are techniques used to determine enterprise needs. Warehousing Strategy (cont’d)

Determine measurement cycle : Describing the cycles or time period used for the measure. Are quarters, months or hours are appropriate to capture useful measurement data? Does it need historical data? Validate measures : After determining and identifying enterprise needs, it is necessary to “reality check” of it. The feedback will be used for refining the measures. Warehousing Strategy (cont’d)

Design data warehouse architecture : This activity involves active user participation in facilitated design sessions. Apply appropriate technologies : Enterprise selects technology, key technology issues, security policies etc. Implementing data warehouse : Loading preliminary data, designing user interface, developing standard queries and reports etc. Warehousing Strategy (cont’d)

Process Flow in Data Warehouse There are four major processes that build a data warehouse: Extract and load data : Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse . It involves: Controlling the Process: D etermining when to start data extraction. It ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time.

When to Initiate Extract : Data warehouse should represent a single, consistent version of the information to the user. So, Data needs to be in a consistent state. Loading the Data : Data is loaded into a temporary data store where it is cleaned up and made consistent . Cleaning and transforming the data : Clean and transform the loaded data into a structure, Partition the data and Aggregation. Process Flow in Data Warehouse (cont’d)

Backup and Archive the data : In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups . Managing queries & directing them to the appropriate data sources : M anages the queries, helps speed up the execution time of queries, Directs the queries to their most effective data sources. E nsures that all the system sources are used in the most effective way, Monitors actual query profiles. Process Flow in Data Warehouse (cont’d)

Process Flow in Data Warehouse (cont’d)

Warehouse Management System (WMS) A  warehouse management system  ( WMS ) is a software application, designed to support and optimize warehouse or distribution center management. They facilitate management in their daily planning, organizing, staffing, directing, and controlling the utilization of available resources, to move and store materials into, within, and out of a  warehouse, while supporting staff in the performance of material movement and storage in and around a warehouse.

A WMS typically has three parts: Load management: Relates to the collection of information from internal or external sources. L oading process includes summarizing, manipulating and changing the data structures into a format that lends itself to analytical processing. Warehouse Management: The management tasks include ensuring its availability , the effective backup of its contents, and its security .

Query management: relates to the provision of access to the contents of the warehouse and may include the partitioning of information into different areas with different privileges to different users. Access may be provided through custom-built applications, or ad hoc query tools . A WMS typically has three parts:

Includes loading preliminary data, implementing transformation program, design user interface, developing standard query and reports and training to warehouse users. Data Warehouse planning and Implementation

ETL Design user Interface Develop standard query Training Users Data Warehouse planning and Implementation (cont’d)

ETL in Data Warehouses The process of extracting data from source systems and bringing it into the data warehouse is commonly called  ETL , which stands for: E xtraction : To retrieve all the required data from the source system with as little resources as possible. T ransformation , and L oading .

1. Extraction of Data W ays to perform the extract: Update notification – If the source system is able to provide a notification that a record has been changed, this is the easiest way to get the data. Incremental extract –They are able to identify which records have been modified and provide an extract of such records.  B y using daily extract, we may not be able to handle deleted records. Full extract -  The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes . H andles deletions as well.

2. Clean:   Ensures the quality of the data in the data warehouse. 3 . Transform: A pplies a set of rules to transform the data from the source to the target .  Converting any measured data to the same dimension using the same units so that they can later be joined.  It also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values .

4. Load: T o ensure that the load is performed correctly and with as little resources as possible . The target of the Load process is often a database.  The referential integrity needs to be maintained by ETL tool to ensure consistency . 5. Managing ETL Process : There is a possibility that the ETL process fails. This can be caused by missing values in one of the reference tables, or simply a connection or power outage.  I t is necessary to design the ETL process keeping fail-recovery in mind.

6 . Staging: A  staging area or  landing zone is an intermediate storage area used for data processing during the ETL process. Primary motivations for their use are to increase efficiency of ETL processes, ensure data integrity and support data quality operations.

Well Known ETL Tools C ommercial tools  :   Ab Initio ,  IBM InfoSphere DataStage ,  Informatica ,  Oracle Data Integrator  and  SAP Data Integrator . Open source ETL tools: CloverETL , Apatar , Pentaho and Talend .

Planning Data Warehouse Data Warehousing comes in all shapes and sizes, which is having a direct relationship to cost and time involved . T he steps listed below are summary of some of the points to consider: Get Professional Advice Plan the Data Who will use the Data Warehouse Integration to External Applications

The key steps in developing a data warehouse can be summarized as follows : Project initiation Requirements analysis Design (architecture , databases and applications) Construction (selecting and installing tools, developing data feeds and building reports) Deployment (release & training) Maintenance

Client/Server computing model & DW It applies to the software architecture that describe processing between application and supporting services. It represents distributive co-operating processing, relationship between client and server is the relationship between hardware and software components. It covers a wide range of functions, services and other aspects of distributed environment.

1. Host based processing Host based application processing is performed on one computer system with attached unintelligent, “dumb” terminals. A single stand alone PC or an IBM mainframe with attached character-based display terminals are example of host-based processing environment. Host based processing is totally non-distributed.

Slave computers are attached to master computer and perform application-processing-related functions only as directed by their master . Distribution of processing tends to be unidirectional- from master to slaves. Slaves are capable of some limited local application processing. E.g. Mainframe (host) computer, such as IBM 3090 used with cluster controllers and intelligent terminals. 2. Master slave processing

3. First-generation client/server processing This generation used to model: Shared device LAN processing environment : PCs are attached to a system device that allows these PCs to share a common resource – file Server on Hard disk or printer Server. E.g. Microsoft’s LAN manager, which allows a LAN to have a system dedicated to file and print services.

Shared device LAN processing environment

Client server LAN processing environment: Extension of shared device processing . E.g. SYBASE SQL Server An application running on PC sends Read request to its database server. DB server process it locally and sends only the requested records to PC applications.

Client server LAN processing environment

Two-tiered architecture to multi-tiered architecture. Computing model deals with servers dedicated to application, data, transaction management and system management. Supported relational to multidimensional to multimedia data s tructure. 4 . Second-generation client/server processing

Distributed database A distributed database system consists of loosely coupled sites that share no physical component. Database systems that run on each site are independent of each other. Transactions may access data at one or more sites.

Types of Distributed database In a homogeneous distributed database All sites have identical software Are aware of each other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy in terms of right to change schemas or software Appears to user as a single system In a heterogeneous distributed database Different sites may use different schemas and software Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing

Distributed DBMS Environment DDBMS architectures are generally developed depending on three parameters − Distribution  − It states the physical distribution of data across the different sites . Autonomy  − It indicates the distribution of control of the database system and the degree to which each constituent DBMS can operate independently . Heterogeneity  − It refers to the uniformity or dissimilarity of the data models, system components and databases.

Other features Data Replication Fragmentation The three dimensions of distribution transparency are − Location transparency Fragmentation transparency Replication transparency

Communication Network Site 1 Site 2 Site 3 Site 4

Data warehouse: Server Hardware The data warehouse operations mainly consist of huge data loads and index builds, generation of materialized views, and queries over large volumes of data. The elemental I/O system of a data warehouse should be built to meet these heavy requirements . Architecture Options: Symmetric Multiprocessing (SMP):  where two or more identical processors are connected to a single, shared main memory . Massive parallel processing (MPP):  large number of  processors to perform a set of coordinated computations in parallel. Number of CPUs Memory of data warehouse Number of Disks

Data warehouse: Server OS Server OS determine: how quickly the server can fulfill client request how many clients it can support concurrently and reliably, how efficient the system resources such as memory, Disk I/O and communication components are utilized.

OS requirements Multiuser Support Preemptive multitasking Multithreaded Design Memory Protection: Concurrent tasks should not violate each others memory. Scalability Security Reliability Availability

Microkernel Technology Relatively small and highly secure than uniprocessors. Simplified architecture, Extensibility, Portability, real time support, robust system security and multiprocessor support. This architecture results into highly modular OS that can support multiple OS “personalities” by configuring outside services as needed. For e.g. Mach 3.0 microkernel used by IBM to allow DOS, OS/2 and AIX OS to coexist on single machine.

Parallel Processors Distributed Memory Architecture: Shared-Nothing Architecture Shared Disk Architecture

1. Shared Nothing architecture Local Memory Local Memory Local Memory Local Memory Processor Unit (PU) Processor Unit (PU) Processor Unit (PU) Processor Unit (PU) Interconnection Network

Local Memory Local Memory Local Memory Local Memory Processor Unit (PU) Processor Unit (PU) Processor Unit (PU) Processor Unit (PU) Interconnection Network 2. Shared-Disk Architecture Global Shared Disk Subsystem

Cluster Technology A cluster is a loosely coupled SMP machines connected by high speed interconnection network. A cluster behave just like a single large machine.