DataBricks fundamentals for fresh graduates

SanjeevaniClinicalRe 216 views 9 slides Jul 31, 2024
Slide 1
Slide 1 of 9
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9

About This Presentation

data bricks


Slide Content

Data Bricks

What is Databricks Databricks is a unified data analytics platform, built on top of Apache Spark, that provides an integrated workspace for data engineers, data scientists, and business analysts. It offers a range of tools for data processing, analytics, and machine learning, enabling organizations to streamline their data workflows and make data-driven decisions. Databricks was founded by the creators of Apache Spark in 013,

Data Lakes vs Data Warehouses A data lakehouse is a modern data management architecture that combines elements of both data lakes and data warehouses to provide a more unified platform for handling various types of data and supporting different forms of analytics. Data Lakes are designed to store vast amounts of raw data in its native format. They are highly scalable and typically used for storing unstructured data like texts, images, and videos, allowing for big data processing and machine learning. Data Warehouses , on the other hand, are structured to hold processed, refined data that is ready for analysis. They support complex queries and are optimized for speed and efficiency in data retrieval, making them ideal for business intelligence and reporting.

Delta Lake Delta Lake is an open-source storage layer that brings reliability, security, and performance to data lakes. Developed by Databricks, it's designed to work with Apache Spark and provides ACID (Atomicity, Consistency, Isolation, Durability) transaction capabilities to big data workloads, which is a critical feature typically associated with traditional databases. Delta table is the default data table format in Databricks and is a feature of the Delta Lake open source data framework. Delta tables are typically used for data lakes, where data is ingested via streaming or in large batches.

QUALITY Good data is the foundation of a Lakehouse All data professionals need clean, fresh and reliable data.

Optimized Databricks Runtime Engine DATABRICKS I/O HIGH-CONCURRENCY Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses Databricks Workspace Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits Databricks

Databricks Workspace : The workspace is the central hub where users can create and manage all their resources, including notebooks, clusters, jobs, and libraries. It provides a collaborative environment where teams can work together on projects, share insights, and maintain version control of their work. Clusters : Clusters are groups of virtual machines that run Spark applications in a distributed fashion. Databricks manages cluster creation, configuration, and scaling, making it easier for users to focus on data analysis and processing without worrying about infrastructure management. Notebooks : Databricks notebooks are interactive documents that combine code, visualizations, and narrative text. They support multiple languages, including Python, Scala, SQL, and R, allowing users to perform data analysis, visualization, and machine learning tasks within a single environment. Jobs : Jobs in Databricks allow users to automate the execution of their notebooks or JAR files on a schedule. This feature is essential for operationalizing data workflows, such as ETL processes, model training, or data updates. Key Components of Databricks

Delta Lake : Delta Lake is an open-source storage layer that brings ACID ( Atomicity, Consistency, Isolation, and Durability) transactions to Apache Spark. It provides features like time travel (data versioning), schema enforcement, and scalable metadata handling, ensuring data reliability and consistency. Databricks MLflow : MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It includes components for tracking experiments, packaging code into reproducible runs, and deploying models. MLflow is deeply integrated into Databricks, enabling seamless collaboration between data scientists and engineers. Security and Governance : Databricks offers robust security features, including role-based access control, data encryption, and compliance with various regulatory standards. It also provides tools for auditing and monitoring to ensure data governance and security best practices. Key Components of Databricks

Data Plane The data plane includes the clusters where data processing happens. Databricks ensures that the data plane is isolated and secure, with data encryption in transit and at rest. Control Plane The control plane manages the workspace, clusters, and job scheduling. It handles user authentication, authorization, and provides a web-based interface for managing the Databricks environment. Integration Layer Databricks integrates seamlessly with a wide range of data sources, including cloud storage (like Amazon S3, Azure Blob Storage), databases, data lakes, and other analytics tools. This integration layer allows for easy ingestion and export of data, facilitating end-to-end data workflows. The Architecture of Databricks
Tags