Basic Ingredients of Data Science

ConductScience

Need Data Services to share your Data & Information?

Data science is a multidisciplinary field; it has more dimensions than there is, as yet, the prospective recognition or measure-up means for the far-reaching explorations. Most of us are still getting to be familiarized with this relatively new phenomenon which emphasizes on:

Learning from data – or as is known today as data analysis.
Data preparation instead of just statistical modeling
Prediction instead of just inference

Years as recent as 2015 witnessed the formal induction of data science programs at renowned universities such as UC Berkeley, NYU, MIT, and the University of Michigan, which effectively challenge the boundaries of the classical theoretical statistics field towards the expansion beyond conventional sciences. It took the process 50 years since the first inception of academic statistics reformation that was pushed forward by well-known scholars, computer scientists, statisticians, like John Tukey, John Chambers, Jeff Wu, Bill Cleveland, Leo Breiman, etc. (Donoho, 2017)

Today, data science is securely positioned as a pivotal factor in assisting the technological transition all over the world. If you have been following our previous writings on the topic of data science in its many variations, you should have a pretty decent idea of:

What is data science?
How does it work?
Basic terms, techniques, and tools of data science
What is a data scientist?
Skills and eligibility to become a successful data scientist
Data science careers

With the same motivation of making it easy for you to understand the many complexities of this unified field, we are undertaking another aspect that is aimed to explain data science through its basic ingredients such as:

Datification
Data Analysis
Computation

So, let’s begin!

Datification

Datification is known as the heart of data science. No surprises there, given how the entire science of data resolves around data manipulation to extract information that is not readily seen. What is surprising, though, is the fact that there is not a proper definition for datification to this date. Only a lot of use of the term along with big data and machine learning. Which renders people on first skim to assume it to be merely a buzzword. But that is not true; there is an appropriate enough description that is mostly put together in a counterproductive manner, which misleads people.

Have you noticed that we are now past the debate of technological integration in our daily lives? Well, it is because the discussions now are more about the efficient performance of the said technology that relies on the data for required/unrequired improvements. It is exactly what makes datification too pertinent to the process and hence referred to as the heart of data science. How? Let’s see.

Datification is an encapsulation of ideas, behavioral patterns, thoughts, and choices into a data form. With the help of technology, datification turn social actions into computerized information or quantified data, which projects varying trends, patterns, and behaviors to make it useable. It starts from data gathering, organizing, filtering, analyzing, and finally transforming into visualized formats that enable a predictive analysis of various trends and patterns. Simply put, datification is a technological development that turns our behavioral activities, daily interactions, and many other aspects of our lives into the kind of computerized data that can be monitored, analyzed, and optimized. (Blei and Smyth, 2017)

It is datification that is transforming traditional organizations into data-driven establishments by converting basic information into a valuable and legible data form.

Examples of Datification

GPS Tracking

The global positioning system (GPS) technology is the prime example of datification in our daily lives. The GPS, which is a network of about 30 satellites, tracks our locations and most visited places through our mobile phones or many other devices. This location information helps relevant authorities to determine our commuting preferences and specified times. The vehicle traffic data that is monitored everywhere at any time is utilized by many applications and websites such as google map or navigator to help our experiences.

Our location and travel information is turned into specific data through datification, which is then consumed by location companies to guide us about traffic density and alternative routes.

Web Resources Network

The datification in the web resources network is done through the user’s information such as IP address, browser type or version, and sometimes even location. When you visit any website, you are often asked to allow/block the cookies. A cookie is a small file that stores user information when you permit it to trace and record your data. This information is used by the different web servers and resource networks to learn and improve the reach of their content as well as user experience.

Social Media Platforms

Social media platforms are the biggest data collection pools. Social media apps like Facebook, Twitter, Instagram, etc., have many ways that urge you into giving away personal/random information about yourself along with the people you know. For instance, you demonstrate your likes and dislikes, your thought process, your expression, your preferences, your location, so on and so forth. All of it is converted into data that can be used to enhance your experience as long as it stays in the hands of the right people. The technique used to extract data from social media platforms is based on a high-end programming language called Neuro-Linguistic Programming (NLP).

Many data extraction tools are developed based on NLP that works by detecting sentiment, attitude, behavior, etc., with the help of linguistics. It analyses the text and expression that predicts the reaction against a certain incident, product, or campaign.

Search engines

Search engine optimization (SEO) is probably one of the most used forms of datification. Search engines track the searches that we make through texts, images, videos, and even voice notes. This data helps companies predict our choices and preferences to develop tailored and targeted ads for each of us individually. The datification of our web search lets relevant companies know what type of product or word is most searched in which region of the world, as well as what product is popular among which age group, sex, and nationality.

This is exactly how the Google Adsense program works. It allows website publishers that count within the Google network to serve interactive media advertisements in the form of text, image, or video to the targeted audience. Our SEO information also helps companies in developing customized data mining tools through open source technology.

Medical Science

Like many other industries, datification is also central to the field of medical sciences for research and analysis purposes. The recent advancement in medicine specifically about the impact of food and diet is all achieved with the help of datification procedures for different cases. The effects of medicines on human anatomy and psychology are digitalized with the help of technology. This data is used by the scientific community to raise awareness and develop treatments for critical illnesses and conditions. Governmental bodies also utilize this data to devise health policies and programs to benefit their citizens by investing in targeted research.

Data Analysis

Data analysis is the very foundation of the whole science of data. It is the process of filtering, converting, and modeling raw data through specific techniques, to be able to extract useful information for the benefit of decision-making. The Data Analysis process is comprised of standard steps and approaches about a type of data for evaluation by using analytical and logical reasoning. A lot of us tend to confuse data analysis with data analytics due to some similarities that go beyond just the names or tools.

Data analysis, in essence, refers to the compilation and analysis of data, while data analytics is essentially only a subcomponent of it. Data analysts are not data scientists; they specialize in trend identification by reading and interpreting data, which is pretty much close to data scientists’ responsibilities with the addition of coding expertise that all data scientists are proficient at. (Waller and Fawcett, 2013)

Data Analysis Process

The process of data analysis is based on certain steps that, upon effective implementation, turn raw, unstructured data into a readable form that allows the extraction of useful insights. These insights are then used to draw various patterns and ultimate conclusions.

The Data Analysis process consists of the following steps:

Data gathering
Data organization
Data cleaning
Data analysis
Data interpretation
Data visualization

Types of Data Analysis

Many types of data analysis processes vary with subject areas such as business, technology, security, and healthcare.

The main types of data analysis are as follow:

Text analysis – also referred to as data mining
Statistical analysis – also referred to as data modeling
Diagnostic analysis
Predictive analysis
Prescriptive analysis

Data Analysis Tools

The tools for data analysis are meant to make the whole process easier and more straightforward. There are many tools and techniques available that assist the process of data manipulation as well as evaluation of datasets relations and correlations. These tools also enable the process of pattern identification and trend interpretation.

Below is a list of data analysis main tools that are frequently used by experienced and unexperienced data enthusiasts alike:

Xplenty
Azure HDInsight
Skytree
Talend
Splice Machine
Spark
Apache SAMOA
Lumify
Elasticsearch
R-Programming
IBM SPSS Modeler

Computation

The computation study is paramount to computer science, which, in turn, is vital to the discipline of data science. Generally, computation is referred to as an arithmetical and non-arithmetical calculation based on any specific model, e.g., algorithm. Whereas, computational science or scientific computing refers to high-performing computing (HPC). Given that data science is all about computational data analysis such as bioinformatics, machine learning, big data, etc., the computation in it becomes more inclusive than statistics or computer science. (Losup et al., 2011)

Data science computation also extends to performance, speed, and subject areas that enable solving differential equations using linear algebra at a rapid pace. Strictly speaking, this may well be referred to as scientific computing or statistical machine learning because almost all numeric scientific problems are expressed in linear algebra. Data science requires computational methods to identify and analyze large datasets to extract insight to draw readable patterns. Data scientists computationally program core data science methods to gauge the accuracy of their programs through mathematics, which underpins the methodology.

There have been many phrases such as leading with computation, computational forensics, computational thinking or computational reasoning, etc., recently. They may seem daunting to someone who is just getting to understand the basics of data science due to technically intensive jargon. All these allusions refer to the process of determining the feasibility of a specific algorithm or developing data re-process methods in case of algorithm failure. These exercises and their frequent practice allow complex algorithms of data science to be computationally reasoned.

Computational Tools

Data science uses a multitude of tools like programming languages, standard statistical tools, and data visualization tools to facilitate the smooth execution of data operations, open-source projects, and commercial & scientific applications or experimentations.

SAS
Apache Spark
BigML
js
MATLAB
Excel
ggplot2
Tableau
Jupyter
Matplotlib
NLTK
Scikit-learn
TensorFlow
Weka

Program Testing Methods

Linear and non-linear regression
Neural networks
Convolutional neural networks
Deep networks
Tonsorial and independent component analysis
Sparsity inducing linear and non-linear regressions