Mine Module

Overview

This step involves math, statistics and data mining.
Data mining is the process by which patterns are detected in data for insight.

There are many data mining techniques that can be used to turn raw data into actionable insights. These involve everything from cutting-edge artificial intelligence (AI) to the basics of data preparation, which are both key for maximizing the value of data.

[Source]

Key Concepts

Basic Descriptors
  • Basic statistical methods that can be used to describe data (mean, max, min, etc)

Categorize
  • Can the data be grouped together?
  • What’s similar?
  • What’s different?

Variables
  • Categorical: nominal, ordinal, interval or ratio, rank
  • Quantitative: discrete, continuous
  • Qualitative: nominal, ordinal

Temporal
  • Is the data streaming data?
  • How is it stored (all at one time, over several years, in years, days, minutes, seconds, etc.)?

Resources


Data Mining Techniques Description
Data cleaning and preparation Data cleaning and preparation is a vital part of the data mining process. Raw data must be cleansed and formatted to be useful in different analytic methods. Data cleaning and preparation includes different elements of data modeling, transformation, data migration, data integration, and aggregation. It’s a necessary step for understanding the basic features and attributes of data to determine its best use.

Without this first step, data is either meaningless to an organization or unreliable due to its quality. Companies must be able to trust their data, the results of its analytics, and the action created from those results.

These steps are also necessary for data quality and proper data governance.
Tracking Patterns Tracking patterns is a fundamental data mining technique. It involves identifying and monitoring trends or patterns in data to make intelligent inferences about business outcomes.

For example: Once an organization identifies a trend in sales data, for example, there’s a basis for taking action to capitalize on that insight. If it’s determined that a certain product is selling more than others for a particular demographic, an organization can use this knowledge to create similar products or services, or simply better stock the original product for this demographic.
Classification Classification data mining techniques involve analyzing the various attributes associated with different types of data. Once organizations identify the main characteristics of these data types, organizations can categorize or classify related data. Doing so is critical for identifying, for example, personally identifiable information organizations may want to protect or redact from documents.
Association Association is a data mining technique related to statistics. It indicates that certain data (or events found in data) are linked to other data or data-driven events. It is similar to the notion of co-occurrence in machine learning, in which the likelihood of one data-driven event is indicated by the presence of another.

The statistical concept of correlation is also similar to the notion of association. This means that the analysis of data shows that there is a relationship between two data events: such as the fact that the purchase of hamburgers is frequently accompanied by that of French fries.
Outlier Detection Outlier detection determines any anomalies in datasets. Once organizations find aberrations in their data, it becomes easier to understand why these anomalies happen and prepare for any future occurrences to best achieve business objectives.

For instance, if there’s a spike in the usage of transactional systems for credit cards at a certain time of day, organizations can capitalize on this information by figuring out why it’s happening to optimize their sales during the rest of the day.
Clustering Clustering is an analytics technique that relies on visual approaches to understanding data. Clustering mechanisms use graphics to show where the distribution of data is in relation to different types of metrics. Clustering techniques also use different colors to show the distribution of data.

Graph approaches are ideal for using cluster analytics. With graphs and clustering in particular, users can visually see how data is distributed to identify trends that are relevant to their business objectives.
Regression Regression techniques are useful for identifying the nature of the relationship between variables in a dataset. Those relationships could be causal in some instances, or just simply correlate in others. Regression is a straightforward white box technique that clearly reveals how variables are related. Regression techniques are used in aspects of forecasting and data modeling.
Prediction Prediction is a very powerful aspect of data mining that represents one of four branches of analytics. Predictive analytics use patterns found in current or historical data to extend them into the future. Thus, it gives organizations insight into what trends will happen next in their data. There are several different approaches to using predictive analytics. Some of the more advanced involve aspects of machine learning and artificial intelligence.

However, predictive analytics doesn’t necessarily depend on these techniques —it can also be facilitated with more straightforward algorithms.
Sequential Patterns This data mining technique focuses on uncovering a series of events that takes place in sequence. It’s particularly useful for data mining transactional data. For instance, this technique can reveal what items of clothing customers are more likely to buy after an initial purchase of say, a pair of shoes. Understanding sequential patterns can help organizations recommend additional items to customers to spur sales.
Decision Trees Decision trees are a specific type of predictive model that lets organizations effectively mine data. Technically, a decision tree is part of machine learning, but it is more popularly known as a white box machine learning technique because of its extremely straightforward nature.

A decision tree enables users to clearly understand how the data inputs affect the outputs. When various decision tree models are combined they create predictive analytics models known as a random forest. Complicated random forest models are considered black box machine learning techniques, because it’s not always easy to understand their outputs based on their inputs. In most cases, however, this basic form of ensemble modeling is more accurate than using decision trees on their own.
Statistical Techniques Statistical techniques are at the core of most analytics involved in the data mining process. The different analytics models are based on statistical concepts, which output numerical values that are applicable to specific business objectives. For instance, neural networks use complex statistics based on different weights and measures to determine if a picture is a dog or a cat in image recognition systems.

Statistical models represent one of two main branches of artificial intelligence. The models for some statistical techniques are static, while others involving machine learning get better with time.
Visualization Data visualizations are another important element of data mining. They grant users insight into data based on sensory perceptions that people can see. Today’s data visualizations are dynamic, useful for streaming data in real-time, and characterized by different colors that reveal different trends and patterns in data.

Dashboards are a powerful way to use data visualizations to uncover data mining insights. Organizations can base dashboards on different metrics and use visualizations to visually highlight patterns in data, instead of simply using numerical outputs of statistical models.
Neural Networks A neural network is a specific type of machine learning model that is often used with AI and deep learning. Named after the fact that they have different layers which resemble the way neurons work in the human brain, neural networks are one of the more accurate machine learning models used today.

Although a neural network can be a powerful tool in data mining, organizations should take caution when using it: some of these neural network models are incredibly complex, which makes it difficult to understand how a neural network determined an output.
Data Warehousing Data warehousing is an important part of the data mining process. Traditionally, data warehousing involved storing structured data in relational database management systems so it could be analyzed for business intelligence, reporting, and basic dashboarding capabilities. Today, there are cloud data warehouses and data warehouses in semi-structured and unstructured data stores like Hadoop. While data warehouses were traditionally used for historic data, many modern approaches can provide an in-depth, real-time analysis of data.
Long-term memory processing Long term memory processing refers to the ability to analyze data over extended periods of time. The historic data stored in data warehouses is useful for this purpose. When an organization can perform analytics on an extended period of time, it’s able to identify patterns that otherwise might be too subtle to detect. For example, by analyzing attrition over a period of several years, an organization may find subtle clues that could lead to reducing churn in finance.
Machine Learning and Artificial Intelligence Machine learning and artificial intelligence (AI) represent some of the most advanced developments in data mining. Advanced forms of machine learning like deep learning offer highly accurate predictions when working with data at scale. Consequently, they’re useful for processing data in AI deployments like computer vision, speech recognition, or sophisticated text analytics using Natural Language Processing. These data mining techniques are good for determining value from semi-structured and unstructured data.
Table content adopted from [Source]


Practice Quiz

Instructions
Choose an answer and hit 'Next Question'. You will receive your score and answers at the end of the quiz.
Download Quiz
Click on "Download" to save a copy of the practice quiz.

Worksheet

Practice the module by completing the worksheet and revise what you learnt.



Self Assessment

Complete this assessment to demonstrate your current knowledge of the Mine stage:

Prerequisites: Make sure to finish the following tasks before working on this assessment.



Review

Horizonal assessment of the mine stage across the data visualization process mapped to Bloom’s Taxonomy of Hierarchical Learning

What you should know:

Bloom’s Taxonomy Hierarchy You should know
Remember This step involves math, statistics and data mining.
Understand Basic statistical techniques (covered in class) for mining data.
Apply Data Mining Techniques suitable for the appropriate data and how to apply them.
Evaluate The mining technique used as to its usefulness in uncovering patterns in data.
Analysis Assess the mined data for patterns in the data?
Create Plan, generate, and produce questions to be answered by the mined data.


Glossary of terms

Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals.

Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process, and includes steps such as cleansing, mapping, and transformation. Data integration ultimately enables analytics tools to produce effective, actionable intelligence.

Categorical Variables: Nominal, Ordinal, Interval or Ratio
Quantitative Variables: Discrete, Continuous

What you should be able to do:

Bloom’s Taxonomy Hierarchy You should be able to do
Remember Describe what happens in the mine stage
Understand Describe the type of techniques to be used to better understand the data.
Apply Execute techniques and methods (statistical methods) on the data.
Evaluate Examine the resulting data and determine if it enables you to answer the question being solved.
Analysis Identify patterns, extreme and subtle features about the data.
Create Determine if the data can support the question to be answered.


You are here:
  • Acquire
  • Parse
  • Mine
  • Sketching & Ideation
  • Filter
  • Represent
  • Critique
  • Refine
  • Interact