This slide describes the basic SQL for BCom students.
Size: 1.33 MB
Language: en
Added: Oct 18, 2024
Slides: 72 pages
Slide Content
SQL Dr. M. Kriushanth Assistant Professor Department of Data Science St. Joseph’s College (Autonomous) Tiruchirappalli
What is a database management system (DBMS)? A database management system (DBMS) is a software system for creating and managing databases. A DBMS enables end users to create, protect, read, update and delete data in a database. It also manages security, data integrity and concurrency for databases.
What does a DBMS do? A DBMS manages the data. The database engine enables data to be accessed, locked and modified and the database schema defines the database's logical structure. These three foundational data elements help provide concurrency, security, data integrity and uniform data administration procedures.
Common Functions of DBMS Administration tasks Storage Concurrency control Centralized view Data manipulation Data independence Backup and recovery
Objectives of DBMS Data Availability Data Organization and Structuring Data Integrity and Consistency Data Security and Authorization Data Independence
Ads and Dis 1) A DBMS(database management system) is used to store and retrieve data in an efficient manner. 1) The cost of DBMS software and hardware is enormously expensive which affects the organizational budget. 2) DBMS is a reliable manager to work with numerous applications. 2) DBMS is inadequate for handling complex mathematical calculations. 3) DBMS guarantees data security and privacy. 3) Most DBMS systems are complicated, employees need to be instructed on how to operate them. 4) DBMS implements a high degree of integrity to ensure the accuracy and protection from unauthorized data. 4) Using the same applications again and again leads to data loss.
Features of DBMS 1. Minimum Redundancy and Duplication 2. Reduced amount of space and money spent on storage (Physical setup of server and storage) 3. Data Organization 4. Customization of the Database 5. Data Retrieval 6. Usage Of Query Languages 7. Multi User Access (IRCTC reservation) 8. Data Integrity is Maintained 9. Management of Metadata 10. Maintenance of a Large Database 11. Data Durability 12. Provides a High Level of Data Security 13. Enhanced File Uniformity
Data Models in DBMS A Data Model in Database Management System (DBMS) is the concept of tools that are developed to summarize the description of the database. Data Models provide us with a transparent picture of data which helps us in creating an actual database. It shows us from the design of the data to its proper implementation of data. Types of Relational Models Conceptual Data Model Representational Data Model Physical Data Model
Conceptual Data Model The conceptual data model describes the database at a very high level and is useful to understand the needs or requirements of the database. It is this model, that is used in the requirement-gathering process i.e. before the Database Designers start making a particular database. One such popular model is the entity/relationship model (ER model). The E/R model specializes in entities, relationships, and even attributes that are used by database designers.
Representational Data Model This type of data model is used to represent only the logical part of the database and does not represent the physical structure of the database. The representational data model allows us to focus primarily, on the design part of the database. A popular representational model is a Relational model. The relational Model consists of Relational Algebra and Relational Calculus. In the Relational Model, we basically use tables to represent our data and the relationships between them.
Physical Data Model The physical Data Model is used to practically implement Relational Data Model. Ultimately, all data in a database is stored physically on a secondary storage device such as discs and tapes. This is stored in the form of files, records, and certain other data structures. It has all the information on the format in which the files are present and the structure of the databases, the presence of external data structures, and their relation to each other. Here, we basically save tables in memory so they can be accessed efficiently. In order to come up with a good physical model, we have to work on the relational model in a better way. Structured Query Language (SQL) is used to practically implement Relational Algebra.
File Organization File Organization refers to the logical relationships among various records that constitute the file, particularly with respect to the means of identification and access to any specific record. In simple terms, Storing the files in a certain order is called File Organization. File Structure refers to the format of the label and data blocks and of any logical control record. T ypes of File Organizations are: Sequential File Organization Heap File Organization Hash File Organization B+ Tree File Organization Clustered File Organization ISAM (Indexed Sequential Access Method)
Sequential File Organization The easiest method for file Organization is the Sequential method. In this method, the file is stored one after another in a sequential manner. There are two ways to implement this method: Pile File Method Sorted File Method
Heap File Organization Heap File Organization works with data blocks. In this method, records are inserted at the end of the file, into the data blocks. No Sorting or Ordering is required in this method. If a data block is full, the new record is stored in some other block, Here the other data block need not be the very next data block, but it can be any block in the memory. It is the responsibility of DBMS to store and manage the new records.
Hash File Organization Data bucket – Data buckets are the memory locations where the records are stored. These buckets are also considered Units of Storage. Hash Function – The hash function is a mapping function that maps all the sets of search keys to the actual record address. Hash Index- The prefix of an entire hash value is taken as a hash index. Every hash index has a depth value to signify how many bits are used for computing a hash function.
B+ Tree B+ Tree, as the name suggests, uses a tree-like structure to store records in a File. It uses the concept of Key indexing where the primary key is used to sort the records. For each primary key, an index value is generated and mapped with the record. An index of a record is the address of the record in the file.
Cluster File Organization In Cluster file organization , two or more related tables/records are stored within the same file known as clusters. These files will have two or more tables in the same data block and the key attributes which are used to map these tables together are stored only once. Thus it lowers the cost of searching and retrieving various records in different files as they are now combined and kept in a single cluster.
ISAM (Indexed Sequential Access Method): A combination of sequential and indexed methods. Data is stored sequentially, but an index is maintained for faster access. Think of it like having a bookmark in a book that guides you to specific pages.
Recent Trends in Database Cloud Databases : Adoption of Cloud-Native Databases : Many organizations are moving to cloud-native databases offered by providers like AWS, Google Cloud, and Azure. These databases are designed to leverage cloud infrastructure for scalability, availability, and performance. Database as a Service (DBaaS) : DBaaS offerings are growing in popularity, allowing organizations to offload database management and maintenance to cloud providers.
Multi-Model Databases : Support for Multiple Data Models : Multi-model databases support different data models (e.g., relational, document, key-value, graph) within a single database engine. This flexibility helps organizations manage diverse data types more efficiently.
Distributed Databases : Scalability and Availability : Distributed databases like Apache Cassandra, CockroachDB , and Google Spanner are designed to handle large-scale data across multiple nodes, providing high availability and fault tolerance. Global Distribution : These databases enable data to be distributed across multiple geographic locations, supporting low-latency access and compliance with data sovereignty regulations.
Graph Databases : Growing Importance of Relationships : Graph databases like Neo4j and Amazon Neptune are becoming more popular for applications that require modeling and querying complex relationships, such as social networks, fraud detection, and recommendation engines.
Time-Series Databases : Handling Time-Stamped Data : Time-series databases like InfluxDB and TimescaleDB are optimized for handling time-stamped data, making them ideal for applications in IoT, monitoring, and analytics.
NoSQL Databases : Flexibility and Performance : NoSQL databases like MongoDB, Couchbase, and DynamoDB continue to grow in popularity due to their schema-less design, which provides flexibility in data modeling and high performance for specific workloads.
AI and Machine Learning Integration : In-Database Analytics : Integrating AI and machine learning capabilities directly within databases enables real-time analytics and predictive modeling, reducing data movement and latency. Automated Tuning and Optimization : AI-driven tools are being used to automatically tune and optimize database performance, reducing the need for manual intervention.
Data Privacy and Security : Enhanced Security Features : Databases are incorporating advanced security features like encryption, access controls, and auditing to protect sensitive data. Compliance with Regulations : Ensuring compliance with data privacy regulations (e.g., GDPR, CCPA) is a critical focus for database vendors and users.
Edge Computing : Databases at the Edge : With the rise of edge computing, databases are being deployed closer to where data is generated to reduce latency and support real-time processing. Serverless Databases : Pay-As-You-Go Models : Serverless database offerings, such as AWS Aurora Serverless and Google Cloud Firestore , provide automatic scaling and cost efficiency by charging only for the actual usage.
Blockchain Databases : Immutable and Transparent Data : Blockchain databases like BigchainDB are gaining traction for applications requiring immutable and transparent data storage, such as supply chain management and digital identity verification. Hybrid and Multi-Cloud Strategies : Flexibility and Avoiding Vendor Lock-In : Organizations are increasingly adopting hybrid and multi-cloud strategies to leverage the best features of different cloud providers and avoid vendor lock-in.
ER Diagram
Entity An Entity may be an object with a physical existence – a particular person, car, house, or employee – or it may be an object with a conceptual existence – a company, a job, or a university course. Entity are of two types 1.Tangible Entity – Which can be touched like car , person etc. 2.Non – tangible Entity – Which can’t be touched like air , bank account etc.
Entity Set: An Entity is an object of Entity Type and a set of all entities is called an entity set. For Example, E1 is an entity having Entity Type Student and the set of all students is called Entity Set. In ER diagram, Entity Type is represented as:
Relational Data Model The relational model represents how data is stored in Relational Databases. A relational database consists of a collection of tables, each of which is assigned a unique name. Consider a relation STUDENT with attributes ROLL_NO, NAME, ADDRESS, PHONE, and AGE shown in the table. ROLL_NO NAME ADDRESS PHONE AGE 1 RAM DELHI 9455123451 18 2 RAMESH GURGAON 9652431543 18 3 SUJIT ROHTAK 9156253131 20 4 SURESH DELHI 18
Important Terminologies Attribute: Attributes are the properties that define an entity. e.g.; ROLL_NO, NAME, ADDRESS Relation Schema: A relation schema defines the structure of the relation and represents the name of the relation with its attributes. e.g.; STUDENT (ROLL_NO, NAME, ADDRESS, PHONE, and AGE) is the relation schema for STUDENT. If a schema has more than 1 relation, it is called Relational Schema. Tuple: Each row in the relation is known as a tuple. The above relation contains 4 tuples, one of which is shown as: 1 RAM DELHI 9455123451 18
NULL Values: The value which is not known or unavailable is called a NULL value. It is represented by blank space. e.g.; PHONE of STUDENT having ROLL_NO 4 is NULL. Relation Key: These are basically the keys that are used to identify the rows uniquely or also help in identifying tables. Primary Key Candidate Key Super Key Foreign Key Alternate Key Composite Key
Relational Algebra Relational Algebra is a procedural query language. Relational algebra mainly provides a theoretical foundation for relational databases and SQL. The main purpose of using Relational Algebra is to define operators that transform one or more input relations into an output relation. Given that these operators accept relations as input and produce relations as output, they can be combined and used to express potentially complex queries that transform potentially many input relations (whose data are stored in the database) into a single output relation (the query results). As it is pure mathematics, there is no use of English Keywords in Relational Algebra and operators are represented using symbols.
Fundamental Operators Selection(σ) Projection(π) Union(U) Set Difference(-) Set Intersection(∩) Rename(ρ) Cartesian Product(X)
Data Dictionary A Data Dictionary can be defined as a collection of information on all data elements or contents of databases such as data types, and text descriptions of the system. It makes it easier for users and analysts to use data as well as understand and have common knowledge about inputs, outputs, components of a database, and intermediate calculations. The following data name type of Information is used to store in a data dictionary: Name Description Name Name generally includes the primary name of all composite data or control items available, and the name of the external entity or data store. Aliases Any other word used in place of Name Where or How it’s used? A data dictionary generally gives information about where and how data or control items are used which may include an input/ output to process. Description A notation for representing content
Normalization Normalization is the process of organizing the data in the database. Normalization is used to minimize the redundancy from a relation or set of relations. It is also used to eliminate undesirable characteristics like Insertion, Update, and Deletion Anomalies. Normalization divides the larger table into smaller and links them using relationships. The normal form is used to reduce redundancy from the database table.
Data modification anomalies can be categorized into three types: Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple into a relationship due to lack of data. Deletion Anomaly: The delete anomaly refers to the situation where the deletion of data results in the unintended loss of some other important data. Updatation Anomaly: The update anomaly is when an update of a single data value requires multiple rows of data to be updated.
Normal Form Description 1NF A relation is in 1NF if it contains an atomic value. 2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional dependent on the primary key. 3NF A relation will be in 3NF if it is in 2NF and no transition dependency exists. BCNF A stronger definition of 3NF is known as Boyce Codd's normal form. 4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no multi-valued dependency. 5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining should be lossless.
BCNF Application of the general definitions of 2NF and 3NF may identify additional redundancy caused by dependencies that violate one or more candidate keys. However, despite these additional constraints, dependencies can still exist that will cause redundancy to be present in 3NF relations. This weakness in 3NF resulted in the presentation of a stronger normal form called the Boyce-Codd Normal Form.
UML The unified modeling language (UML) is a general-purpose visual modeling language that is intended to provide a standard way to visualize the design of a system.
Relational Database Languages DDL Create It is used to create objects in the database Alter It is used for change or alter the structure of the database objects Drop It is used for delete objects from the database Truncate It is used for remove all records from a table Rename It is used to rename the object in database Comment It is used for comment on the data dictionary.
DML Select It is used for select data from the Table based on the requirements Insert It is sued for Inserting data into existing table Update It is used for update data in the Table based on the requirement Delete It is used for delete data from the Table Merge It is used for upsert operations Call It is used for call a structured query language or Java sub program Lock Table It have ability to control the concurrency
DCL Grant It is used for give user access to the database Revoke It is used for to take back the access or permissions from the user TCL Commit It is used for save the transaction on the Database. And these very useful in banking sectors. Rollback It is used for restore the database to original state from last commit. This command also plays an important role in Banking Sectors.
Database Administration Database administration (DBA) involves the management, maintenance, and optimization of databases to ensure they operate efficiently, securely, and reliably. A database administrator is responsible for various tasks, including installation, configuration, performance tuning, security, and data recovery. 1. Database Installation and Configuration Installation : Installing the database software (e.g., MySQL, Oracle, SQL Server) on the server. Configuration : Setting up databases based on the organization’s needs, including specifying database locations, memory allocation, storage configuration, and connection settings. 2. Backup and Recovery Backup Strategy : Implementing regular backups to ensure that data can be restored in case of failure. Disaster Recovery : Planning and executing strategies to recover databases after catastrophic events (e.g., hardware failure, data corruption).
3. Performance Tuning Query Optimization : Analyzing and optimizing SQL queries to reduce response times. Indexing : Implementing indexes to improve the speed of data retrieval. Resource Allocation : Allocating CPU, memory, and disk resources for optimal performance. Monitoring Tools : Using tools to monitor database performance, identify bottlenecks, and resolve issues. 4. Security Management User Access Control : Managing users and roles, and granting appropriate permissions to access data. Encryption : Implementing encryption for sensitive data both at rest and in transit. Audit Logging : Monitoring and logging database activities to detect unauthorized access or changes.
7. Database Patching and Upgrades Patch Management : Applying patches and updates to fix bugs, improve performance, or address security vulnerabilities. Version Upgrades : Planning and executing upgrades to newer versions of database software. 8. High Availability and Replication Clustering : Setting up database clusters to ensure that the database remains available even if one server fails. Replication : Configuring data replication between multiple servers or sites to enhance data redundancy and availability. Failover Solutions : Implementing failover mechanisms to switch to a backup database in case of a primary database failure.
9. Capacity Planning Storage Management : Monitoring and managing disk space, and planning for future growth. Performance Forecasting : Analyzing trends in database usage to predict and plan for increased workloads. 10. Automation and Scripting Scripting Tasks : Automating routine tasks like backups, monitoring, and maintenance using scripts. Job Scheduling : Scheduling regular tasks (e.g., backups, index rebuilding) to run automatically.
11. Compliance and Auditing Regulatory Compliance : Ensuring that the database adheres to regulatory requirements like GDPR, HIPAA, or SOX. Auditing : Implementing audit trails to track who accessed or modified the data. 12. Monitoring and Alerts Real-time Monitoring : Monitoring the health, performance, and security of the database in real-time. Alerts : Setting up alerts for abnormal behavior, like long-running queries or low disk space, and taking proactive measures.
13. Database Documentation Documentation : Keeping detailed documentation of the database setup, architecture, and configurations. Change Management : Documenting any changes made to the database, including schema updates, patch applications, and tuning. 14. Troubleshooting Error Handling : Investigating and resolving database errors, performance issues, and crashes. Log Analysis : Analyzing database logs to find root causes of issues.
File Structure and Indexing 1. Heap Files (Unordered Files) Description : Data is stored in random order as it is inserted. There is no particular order in how the records are arranged. Advantages : Simple to implement, fast for insertions. Disadvantages : Slow for searching as the entire file must be scanned for a query. Use Case : Best suited for small datasets or when the primary operation is insertion.
2. Sequential Files (Ordered Files) Description : Records are stored in a sorted order based on a particular attribute (e.g., primary key). Advantages : Efficient for range queries and sequential access. Disadvantages : Insertion, deletion, and updates are expensive because they disrupt the order. Use Case : Useful for scenarios where queries often involve retrieving records in a particular order.
3. Hashed Files Description : A hash function is applied to an attribute (e.g., primary key), and the result determines where the record is stored on disk. Advantages : Efficient for exact match queries (e.g., looking up a record by its primary key). Disadvantages : Inefficient for range queries, as records are distributed based on the hash function. Use Case : Common for key-value databases and systems with frequent exact-match lookups.
4. Clustered Files Description : Related records from multiple tables are physically grouped together on disk based on a common attribute. Advantages : Reduces disk I/O for queries that retrieve related records from multiple tables. Disadvantages : Complex to manage and can result in fragmented data. Use Case : Ideal when there are frequent joins between tables.
An index is a data structure that allows the DBMS to locate records quickly without scanning the entire file. Indexes are similar to the index in a book, where you can look up a keyword and find the relevant pages quickly. Primary Index Secondary Index Clustered Index Non-Clustered Index Unique Index Composite Index (Multicolumn Index) Full-Text Index
Client-Server Database Server: Hosts the database. Manages data storage, data processing, and business logic. Handles requests from multiple clients. Examples: MySQL, PostgreSQL, SQL Server, Oracle Database. Client: Accesses the server to perform operations like querying data, inserting, updating, or deleting records. Typically, clients are applications or user interfaces that allow users to interact with the database. Examples: Web applications, desktop applications, mobile apps.
How It Works Request-Response Model: The client sends a request (usually an SQL query) to the server. The server processes this request, executes the query, and sends the results back to the client. This communication typically occurs over a network
Benefits of Client-Server Database Architecture Centralized Management: Data is managed and maintained on the server, making it easier to enforce security, backups, and updates. Scalability: The server can handle multiple clients simultaneously, and the system can be scaled by upgrading the server or adding more clients. Security: Access control and other security measures are easier to implement on the server-side.
Client-Server Database Examples Web Applications: A web app like an e-commerce site where the client (browser) sends requests to a server, which queries a database for product information. Enterprise Applications: A company’s internal software where employees (clients) access a central database on a server for tasks like inventory management or customer relationship management (CRM).
Data Warehousing Definition: A data warehouse is a large-scale, centralized repository of structured and, sometimes, semi-structured data from different sources, optimized for query and analysis rather than transactional processing. Purpose: The primary goal of a data warehouse is to provide a unified, consistent source of truth for business intelligence (BI) activities like reporting, data analysis, and data mining.
Components of a Data Warehouse Data Sources: These are various operational databases, CRM systems, ERP systems, web logs, etc., from which data is extracted. ETL (Extract, Transform, Load): A critical process that involves: Extract: Pulling data from source systems. Transform: Cleaning, formatting, and converting data into a usable format. Load: Storing the transformed data into the data warehouse. Data Storage: The central data repository where the processed data is stored. It is typically organized in a structured format, such as tables and schemas. Data Marts: Subsets of the data warehouse tailored for specific departments or functions, like sales, finance, or marketing.
Architecture Single-Tier Architecture: Simplifies the design by combining ETL, storage, and analysis layers, but is less common due to performance limitations. Two-Tier Architecture: Separates the data warehouse storage from the analysis and reporting tools but often faces scalability issues. Three-Tier Architecture (most common): Bottom Tier: The database server (data storage). Middle Tier: The OLAP (Online Analytical Processing) server or data access layer that processes data queries. Top Tier: The front-end client tools used for analysis, querying, and reporting.
Benefits of Data Warehousing Improved Data Quality and Consistency: Centralizing data in a single repository improves data quality and standardizes formats. Faster Decision-Making: By having all relevant data in one place, organizations can make data-driven decisions quickly. Historical Data Analysis: Warehouses store large volumes of historical data, allowing for trend analysis and predictive analytics. Scalability: Designed to handle large amounts of data from multiple sources.
Types of Data Warehouses Enterprise Data Warehouse (EDW): Centralized and integrated warehouse that provides decision support for the entire organization. Operational Data Store (ODS): A staging area that consolidates data from different sources for operational reporting. Data Mart: Smaller, more focused warehouse for specific departments or business functions.
Cloud Data Warehousing Modern data warehousing solutions are increasingly moving to the cloud (e.g., Amazon Redshift, Google BigQuery , Snowflake), offering advantages like scalability, cost efficiency, and easier maintenance.