Hit 2203-Big Data & Data Analytics FACILITATORS: L.Amos S.Chaputsira T.Butsa
Data analytics life cycle
Apache spark Processing framework for big data
Spark supports various programming languages
Spark features
Components of apache spark
Data analytics life cycle The Data Analytics Lifecycle is a cyclic process which explains, in six stages, how information is made, collected, processed, implemented, and analyzed for different objectives
Discovery phase Understanding the problem statement, thorough study of the business model .This phase involves : U nderstanding of the business problem Asking questions Meeting up with all the stakeholders Understanding what kind of data is available Is there any example of the same problem that have been solved earlier
Data preparation Also known as data munging or data manipulation is the most important task in the data life cycle for any valuable insights to pop up. Raw data on its own is meaningless therefore the data scientist would want to explore the data ,take a look at some sample data by taking a few records to discover whether there are any gaps on the data and is the structure of the data appropriate to feed into the system, are there any columns which are not adding value and if they are there these columns may not be required for analysis e.g Name of customers column may not add nay value for analysis perspective. There may be gaps in the data so we need to fill those gaps with something meaningful
Model planning This step involves exploratory data analysis (EDA) to understand the relation between variables and to see what the data can tell us Key variables are selected
Various techniques can be used for model planning which includes:
Model building Using various analytical tools and techniques, data is transformed with the goal of discovering useful information to build the right model
Communicating the results Key findings are identified and communicated to the stakeholders
Operationalize Final reports, code and technical documents are delivered by the team