MALLABHUM INSTITUTE OF TECHNOLOGY Approved by AICTE & Affiliated to MAKAUT WB Presented by Bapan Kar University Roll Number:15800121011 Year: 3 rd Semester:6 th Subject: Data warehousing and Data Mining (PEC-IT602B) Academic Year:2023-2024
What is a data warehouse? A data warehouse is a powerful database model that significantly enhances the user’s ability to quickly analyze large, multidimensional data sets. It cleanses and organizes data to allow users to make business decisions based on facts. Hence, the data in the data warehouse must have strong . Data Warehousing Definition:- Data warehousing is an aspect to gather data from multiple sources into central repository, called Data warehouse. According to William H.lnmon , a leading architect in the construction of data warehouse systems ,”A data warehouse is a subject – oriented, integrated, time variant and non- volatile collection of data in support of management’s decision making process. “A data warehouse is simply a single complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” --------Barry Delvin, IBM consultant Data Wrehouses : Data spread in several database– physically located at numerous sites Data warehouse-repository of multiple SBs in single scheme; resides at single site. Data warehousing processes 1. Data cleaning 2. Data integration 3.Data transformation 4. Data loading 5. periodic data refreshing
Data source in Vancouver Data source in New York Data source in Chicago Clean Transform Integrate load Data warehouse Query And Analysis tools Client client Data warehouse diagram
Data cleaning:- Data cleaning includes, filling in missing values, smoothing noisy data, identifying or removing outlines, and resolving inconsistencies. Data integration:- Data integration includes integration of multiple database, data cubes, or files. Data transformation:- Convert data from legacy or host format to warehouse format. Load:- sort; summarize , consolidate; compute views; check integrity. Build indices and partitions. Refresh :- Propagates the update from data sources to the warehouse. Data in a data warehouse are organized around major subjects. Data provide information on historical perspective – summarized on periodic dimension. Eg. Sales of an item for a region in a period Data warehouse model- multidimensional database structure/ data cube Dimensions –Attribute/ set of attributes Facts- Aggregated measures (count/sales amount)
History of data warehousing The concept of data warehousing dates back to late 1980s when IBM researcher Barry davlin and paul murphy developed the ‘the business data warehouse’. In essence , the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support enviroments . Facts about data warehousing:- Issues involved in warehousing include techniques for dealing with errors and techniques for efficient storage and indexing of large volumes of data. This system is used for reporting and data analysis. It usually contains historical data derived from transaction data. Data warehousing is not meant for current “live” data. Components of a data warehouse Sources – Data source interaction Data warehouse (data storage) Reporting (data presentation) Metadata
Data Warehouse Advantages Complete control over the four main areas of data management systems:- Clean data Query processing : multiple options Indexes: multiple types Security: data and access Data Warehousing Disadvantages Adding new data sources takes time and associated high cost. Data owners lose control over their data, raising ownership , security and privacy issues. Long initial implementation time and associated high cost. Difficult to accommodate changes in data types and ranges, data source scheme, indexes and queries.
Characteristics of data warehousing:- Subject- oriented:- A data warehouse can be used to analyze a particular subject area. For example:-” sales” can be a particular subject. Integrated:- A data warehouse integrates data from multiple data sources. For example:- Source A and B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time Variant:- Historical data is kept in a data warehouse. For example:- one can retrieve data from 3 months, 6 months,12months, or even older data from a data warehouse. Non volatile:- Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. It must be optimized for access to very large amount of data. It is based on client server architecture. It is capable of handling dynamic matrices. It maintains transparency. It is consistent and flexible.
DATA WAREHOUSE USAGE:- Three kinds of data warehouse applications 1.Information processing:- Supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs. 2.Analytical processing:- Multidimensional analysis of data warehouse data Supports basis OLAP operations ,lice-dice, drilling, pivoting 3.Data mining:- knowledge discovery from hidden patterns Supports associations , constructing analytical models, performing classification and predition , and presenting the mining results using visualization tools.
TRENDS IN DATA WAREHOUSING In the next few years, data warehousing is expected make big strides in software, especially for optimizing queries:- Indexing very large tables Enhancing SQL Improving data compression Expanding dimensional modelling Real-Time warehousing Multiple data types Adding unstructured data Spatial data Data visualization Major visualization trends Visualization types Advanced visualization techniques chart manipulation. Drill down. Advanced interaction
Bottom tier:- The bottom tier is a warehouse database server that is always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources. These tools and utilities perform data extraction ,cleaning and transformation as well as load and refresh functions to update the data warehouse. The data extracted using application program interfaces known as gateways. Example of gateways are ODBCand OLEDB by Microsoft and jdbc This tier also contains a metadata repository , which stores information about the data warehouse and its contents. Middle tier:- The middle tier is an OLAP server is typically implemented using either:- A relational OLAP model that is, an extended relation DBMS that maps opertions . Intermediate server b/w relational back-end server and client front end tools. A multidimentional OLAP model that is , a special purpose server that directly implements multidimentional data and opertions . Supports multidimentional views. Top tier :- The Top tier is a front – end client layer, which contains and reporting tools , analysis tools, and or data mining tools.
Conclusion:- ”Summarize key points and emphasize the importance of data warehousing in the modern business landscape.”