Data Lakes, Data Science, Big Data, Data Analytics? Over the last few years, marketing departments have bombarded us with these terms, constantly promising us the next big, new thing. Unfortunately most of the messaging contains little substance, let alone anything that explains what Data Science exactly is. Far from being an expert Data Scientist, I am certified as one, and it took me sweat and tears to understand the concept.
Data Science is a methodology, not a product. And it uses statistics, just like life sciences, physics or psychology, to make sense of (research) data. What is new is the technology to handle increasingly large volumes of data, structured (organized in database tables), unstructured (video, images, text documents) or a combination of both, and process that data at staggeringly fast speeds. But let’s start with the six step methodology, and delve into the technical aspects along the way.
Step 1, Discovery: You need to understand the domain that you are investigating, and scope out what you wish to analyze and deliver. This implies you frame the business problem as an analytical challenge.
Step 2, Data Preparation: You need to create a digital sandbox, extract relevant data from various sources, familiarize yourself with that data, and transform that data into usable formats. This is by far the hardest step in the process, and determines the project outcome. Good clean data produces reliable insights. A digital sandbox, or a data lake, is a collection of data from various sources, set up for analysis.
Step 3, Model Planning: You need to determine which (statistical) models can derive the desired outcome and insight. Deep understanding of statistical models like Lineair/Logistical Regressions, Association Rules, Time Series Analysis, Decision Trees and Naïve Bayesian Classifiers are a must. Open Source packages like ‘R’ can help determine which model(s) would be a good fit for your project.
Step 4, Model Building: This sees you through the actual implementation of the statistical model in, usually a MapReduce environment. MapReduce is a programming model that lets you process tasks in parallel, thereby enabling analysis of Big Data volumes. Classic statistical applications like SAS, SPSS, JMP allow in-memory processing of structured data only, so limiting the data volume they can handle. Hadoop is the most known platform that implements MapReduce for ‘analyzing’ the data, combining it with Hadoop Distributed File System, for ‘storing’ the big data volumes.
Step 5, Communicating Results: This assumes your model(s) discovered valuable insights or made accurate predictions (customer churn, mortgage default probability, category analysis, component failure rates, etc.). Then you need to translate the findings into recommendations and practical improvement proposals. Communicate the findings to the business owners, get their buy-in, and provide all relevant analysis and insights so they can start improving what was analyzed.
Step 6, Operationalize: This is all about getting the model up and running for future data sets, so that it can be reused for new situations or to compare the old ‘as is’ with the new ‘to be’.
This six step methodology can be devilishly difficult, and experienced Data Scientists, possessing statistical knowledge and data manipulation skills that are able to translate outcomes into business insights, are rare as dodos. Now, hopefully, technological improvements and rising demand will change that.