Overview of what big data is, how it is different than other data, what data scientists are/do, and what the benefit of data science can be.
Size: 121.09 KB
Language: en
Added: Sep 29, 2017
Slides: 14 pages
Slide Content
Big Data and Data Science Overview Colleen M. Farrelly
Part 1: Big Data
Oxford English Dictionary: “An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications” Defined by volume, variety, velocity 2008 computer scientist predictions: Big Data will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and intelligence operations” According to the New York Times: Big data science “typically means applying the tools of artificial application of intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases” What is “Big Data?”
Wider Longer Wider and Longer Complex subgroupings within wider or longer sets Many correlations Noisy Missing data What does BIG DATA look like?
Computational challenges of storage and statistical program memory R space on a laptop is limited to 2 GB unless more RAM is added Algorithm computing time grows according to scaling rules, many of which are exponential. Thus, 2 GB takes 4 minutes, and 4 GB then takes 16 minutes… Statistical challenges from data structure Wide data violates many statistical assumptions. Correlations among predictors also violate statistical assumptions and creates problems with the underlying linear algebra calculation methods. Potential for lots of informative missing data that can’t be imputed using existing statistical methods. Why Big Data is Different
More computing resources Expensive Cloud computing Does not solve statistical issues posed by big data New statistical methods Rely on a new set of tools from computer science Work around limitations of existing multivariate data analysis methods Don’t always scale as big data grows Still have computational issues Need for larger and larger training sets for good performance Solutions to Big Data Challenges
Hadoop Open-source software for storage and processing of big data across computer cores/clusters Compatible with existing statistical software MapReduce Distributed computing strategy for big data processing and analyses Compute problem in parallel and combine final answers for shorter compute times SQL/NoSQL Relational database language for: Database construction/modifications Pulling pieces of data for further analyses/reporting R Free open-source software with existing machine learning algorithms and coding environment to create and test new machine learning algorithms Simulations Use data structure and relationship rules to create a dataset with pre-specified structure to it Allows for testing and validation of new algorithms against datasets with known answers Useful for comparing existing algorithms with new algorithms Overview of Computing Tools
Statistics Hypothesis testing (parametric and nonparametric) and experimental design Generalized linear models Longitudinal, time series, and survival models Bayesian methods Mathematics Multivariable calculus Linear algebra Probability theory Optimization Graph theory/discrete math Real analysis/topology Machine learning Technically, considered a branch of statistics Supervised, unsupervised, and semi-supervised models Serve to extend statistical models and relax assumptions on data Includes algorithms from topological data analysis and network analysis Overview of Mathematical Tools
Part 2: Data Scientists
A professional who blends several different areas of expertise to draw insights from disparate data sources (particularly big data) such that inference can be made about specific problems/decisions within the field of application Data science is a blend of statistical, machine learning, computer science, mathematical, and domain knowledge to leverage data for decision-making in that domain (business, medical, social media…). What is a “Data Scientist?”
Discuss problem with leadership to understand the problem and how results might be used. Providing a predictive algorithm that performs well but doesn’t provide insight into the problem might not be useful. There may be related items that leadership hasn’t considered, items that can enrich the project. Define data that needs to be pulled. May exist in database. May need to find elsewhere. Pull and clean data. Examine for errors or bias. Deal with missing data. Perform analyses and interpret output. Can be supervised (fit to outcome) or unsupervised (exploratory). Typically involves visualization of important results. Compile summary of actionable insights for leadership. Simplification Business value (no point in doing analysis if it can’t be implemented!) What is their process?
Mathematical/Statistical Background Graduate degree, typically in mathematics/statistics, computer science, or engineering Training in machine learning and algorithm design Experience with R and SAS statistical languages/programs Computer Science Background Python/MATLAB/other high-level computing languages Hadoop/ MapReduce concepts SQL or NoSQL coding for database extraction/management Experience with structured or unstructured data Data mining/algorithm design Field of Application Expertise Intellectual curiosity Understanding of the industry of application (marketing, medical, finance…) Communication skills to relate findings to non-technical leaders What skills do they typically have?
From a quick Indeed.com search: Allstate Insurance Sprint Twitter APS Healthcare XOR Security LinkedIn IBM Intel What industries use them? Indeed.com search continued: Roche Pharmaceuticals Amazon Capital One
According to NewVantage and others: 2016 revenue gained from data science is estimated at $130.1 billion. This is expected to grow to $203 billion by 2020. Individual company results vary according to: Team talent and expertise Data collected (and quality of data) Competitor strengths in data science. Current and projected shortages of those with analytics talent will impact the market. Hubs of data science are emerging outside California—Boston, New York, Austin, Chicago, Jacksonville, Tampa, Charlotte, Atlanta… Across industries—healthcare, tech, finance, energy… Are they providing value?