Course Overview, Data Science Lifecycle and its' applications
Size: 5.17 MB
Language: en
Added: Sep 05, 2024
Slides: 26 pages
Slide Content
Course Overview An overview of data science, CS 577, and the data science lifecycle. Josh Hug and Lisa Yan 1
2 Intros What is data science? What will you learn in this class? Course overview Lots of important details Data Science Lifecycle Demo What is Data Science?
Why I Care About Data Science 3
Why Data Science? The world is complicated, and data is a tool for finding truth in this complicated world! We have a lot of questions in different domains that need to know the answer Data science : Uses a combination of methods and principles from statistics and computer science to work with and draw insights from data. 4
Data-Centric Problems Assess whether a vaccine works Filter out fake news automatically Calibrate air quality sensors Advise analysts on policy changes 5
Primary Goal of This Course 6 Be able to take data and produce useful insights on the world’s most challenging and ambiguous problems.
What is Data Science? PRINCIPLES AND TECHNIQUES OF DATA SCIENCE 7
Data is changing the world From Joey Gonzalez. 8
Data science is a fundamentally interdisciplinary field Joey Gonzalez Data Science is the application of data centric, computational, and inferential thinking to: Understand the world (science). Solve problems (engineering). 9
Data Science Venn Diagram by Drew Conway in 2010 ( link ) 10
Insight Good data analysis is not: Simple application of a statistics recipe. Simple application of statistical software. There are many tools out there for data science, but they are merely tools. They don’t do any of the important thinking! “The purpose of computing is insight, not numbers.” - R. Hamming. Numerical Methods for Scientists and Engineers (1962). 11
Example Questions in Data Science Some (broad) questions we might try to answer with data science: What show should we recommend to our user to watch? In which markets should we focus our advertising campaign? What areas of the world are at higher risks for climate change impact in 10 years? 20? What should we eat to avoid dying early of heart disease? Do immigrants from poor countries have a positive or negative impact on the economy? Is the world getting better or worse? 12
13 Intros What is data science? What will you learn in this class? Course overview Lots of important details Data Science Lifecycle Demo What will you learn in this class?
Tentative List of Topics to be Covered in CS-577 Pandas and NumPy Relational Databases & SQL Exploratory Data Analysis Regular Expressions Visualization matplotlib Seaborn plotly Sampling Probability and random variables Model design and loss formulation Linear Regression Feature Engineering Regularization, Bias-Variance Tradeoff, Cross-Validation Gradient Descent Logistic Regression Decision Trees and Random Forests 14
Course Websites / Platforms 15
Online platforms Course website on Canvas Where all lectures, assignments, and discussions are posted. Textbook ( www.textbook.ds100.org ) Supplemental reading. 16
Programming Environment
Jupyter Notebook “ Jupyter notebooks are documents that combine live runnable code with narrative text (Markdown), equations (LaTeX), images, interactive visualizations and other rich output” Installing Jupyter https:// jupyterlab.readthedocs.io / en /stable/ getting_started / installation.html
JupyterLab 19 JupyterLab offers notebooks and more tools for data science. Use JupyterLab locally on your own machine. Use Google Colab
Learning Advanced JupyterLab Resources for learning fancier JupyterLab functionality: A quickest intro is this great 2-minute overview by Serena Bonaretti . Note: Unlike Serena’s example, in our course we’re using JupyterLab notebooks hosted on the internet, not on your own local computer. The interface overview from the official docs has more details and short, embedded videos. A more detailed discussion from a bio/data angle: ~45 minute video . Full ~3h in-depth tutorial is available from the core team. 20
Google Colab What is Colab ? Colab , or " Colaboratory ", allows you to write and execute Python in your browser, with Zero configuration required Access to GPUs free of charge Easy sharing 21
Course Logistics Content and workflow 22
23 Weekly Flow Class Days: TTR Class Times Section 1: 12:30 pm -- 1:45 pm Class Times Section 2: 2:00 pm -- 3:15 pm Class Location: LH 347
Discussion Section There is a discussion board in Canvas. Two types of topics: Topics covered in lecture Topics covered in assignments 24
Homework 4 assignments in Jupyter Notebook that must be individually submitted Midterm exam: Oct. 19 Final exam: Dec. 12 A group term project: by Dec. 10 Format : Current plan: Primarily in-person exams with the option for virtual exams. Details TBD. Alternate exam times will be provided for all exams for pre-approved reasons, such as a concurrent final exam. If you miss an exam due to a personal emergency or illness, please contact me. 25
Grading Logistics Grades will be posted on Canvas Deadlines are firm at 11:59PM. Extensions are provided only to students with DSP accommodations, or in the case of exceptional circumstances, only if you email me before the deadline. You can submit assignments up to 2 days late, at 10% off per day. Rounded up to the next day: 2 minutes late = 1 day late. 26 Mid-term exam 25% Final exam 25% Assignments 30% Discussions 5% Semester project 15%