When Data Meets Mining
Dr. Zhang Lei, Senior Consultant at SAS Software Co., Ltd.
The wave of informatization has brought about tremendous changes worldwide. When you use a credit card to pay for purchases, the transaction information and shopping details are already entering the databases of banks and malls; when you pick up your phone to make a call, the billing information is already entering the database of telecommunications operators; when you register at a hospital for medical treatment, the outpatient and prescription information is already entering the hospital's database; when you use a public transportation card to ride, the ticket purchase information has also entered the public transport company’s database. All these scenarios indicate that information is everywhere, and every moment, a large amount of new information is being generated, much like the rush-hour crowds, congested and noisy.
This is just a series of static images. If we let time be the film strip, connecting these images together, it becomes even more vividly apparent that in the last two or three decades, the various data accumulated by enterprises has far exceeded our imagination. It's like watching a sci-fi disaster movie where the ever-increasing data is akin to rising sea levels due to global warming, gradually approaching the ocean where we live. The familiar buildings, parks, and roads are one by one submerged...
Do you feel this is somewhat alarmist? But for companies, this is not a myth. Let me give an example to show how the rapid growth of data can bring such great trouble and change to businesses. Sam Walton was born in 1918 in Kingfisher, Oklahoma, USA. He was a native rural person who started delivering milk and newspapers at the age of seven. He also raised rabbits and pigeons for sale, and most of his tuition and living expenses were covered by his own part-time work. He obtained a Bachelor of Commerce degree from the University of Missouri. In 1945, after leaving the military, Sam opened a discount store in Bentonville, Arkansas, which was a typical old-style grocery store, 50 feet wide and 100 feet deep, facing the main street, located in the city center with a view of the railway. The store had a cash register, and there were aisles behind each counter for the clerks to move around, waiting for customers to come in. At the start of his business, there were very few customers, and Sam could remember the names of regular customers, knowing what kind of grocery items they liked, what brands they commonly used, which products sold the best, and what goods should be purchased next month, etc. At this point, his data processing and analysis only required pen and paper.
After decades of relentless effort, Sam's chain stores have spread globally, with revenues reaching $351.1 billion in 2006, surpassing ExxonMobil, the American oil giant, and ranking first on Fortune magazine's list of the world's top 500 companies. Sam's company is called "Walmart" (WalMart).
As the world's largest retail giant, today's scale is far beyond what it was at the start of its venture. Walmart operates in 14 countries, with 7,131 chain stores and nearly 2 million employees, serving hundreds of millions of customers. Every day, a massive amount of transaction information is continuously transmitted to the company headquarters' data warehouse, with data capacity exceeding hundreds of terabytes (TB). How to discover marketing opportunities from the ocean of information, find profitable customers, adjust the placement of goods, coordinate logistics planning and scheduling, etc., cannot simply rely on human experience, looking at reports or manual analysis to obtain answers. People need to depend more on the processing power of computers and advanced analytical techniques to assist in discovering potential patterns from vast amounts of data.
One of these advanced analytical techniques is data mining, and one of the most typical cases in the field of data mining is the "beer and diapers" story.
### Three Small Stories
#### Story One: Beer and Diapers
Walmart, one of the world's largest retail chains, owns one of the largest data warehouse systems in the world, storing detailed transaction information from each of its stores. To better understand customer purchasing habits, Walmart conducted basket analysis on customer shopping behavior, wanting to know which products customers frequently buy together. As a result, they made an unexpected discovery: "The product most often bought together with diapers is beer!"
This is the result of data mining technology analyzing historical data. Does it conform to real-world situations? Is it useful knowledge? Does it have practical value?
Thus, Walmart dispatched market researchers and analysts to investigate and analyze this data mining result. After extensive investigation and analysis, it revealed a hidden behavioral pattern among Americans related to "diapers and beer": some young fathers would often go to the supermarket after work to buy baby diapers, and 30% to 40% of them also bought beer for themselves. The reason for this phenomenon is: American wives often instruct their husbands to buy diapers after work, and after buying the diapers, the husbands also conveniently brought back their favorite beer.
Since diapers and beer are often bought together, Walmart decided to place them side by side. The result was a simultaneous increase in the sales of both diapers and beer.
By conventional thinking, diapers and beer seem unrelated. Without the help of data mining technology to analyze a large amount of transaction data, Walmart would not have been able to discover this valuable rule hidden within the data.
#### Story Two: The Roots of Crime
Gloucestershire is a county in western England with a population of about half a million. For a period of time, there were multiple robbery cases, making the public feel unsafe. Public opinion put immense pressure on the Gloucestershire Police Department, strongly demanding timely resolution of these cases and prevention of further incidents. While the police were accelerating investigations, they were also actively considering ways to reduce the crime rate.
Traditionally, measures like locking down areas prone to robberies, increasing police patrols, and enhancing checks on suspicious individuals would be taken. However, the Gloucestershire Police Department found that these measures were not very effective, as the robbery locations were not concentrated but scattered across different neighborhoods, making it difficult to comprehensively deploy patrol forces.
At this time, the internal analysis system of the police department made a new discovery. The system contained years of case files and criminal records. By using data mining and other analytical techniques, it revealed that the recent robbers shared some very significant characteristics: most of them were homeless and unemployed, with no stable jobs. Additionally, before many robberies occurred, these criminals had consumed drugs. Under the influence of drugs, they lost self-control, temporarily acting on greed by robbing solitary women or couples.
The new discovery gave the police department new ideas. They decisively adjusted their original approach of increasing police presence and strengthening patrols, instead adopting the following measures: First, strengthening management over unemployed individuals and those with drug-related offenses, and providing them assistance through social welfare institutions; second, intensifying crackdowns and management on places where drug transactions frequently occur, cutting off the supply of drugs at the source.
The measures were highly effective, and the incidence of robberies dropped rapidly. The people of Gloucestershire restored their peaceful lives.
#### Story Three: Emails Plus News
Yahoo was the first company to hire a Chief Data Officer to validate the fact that data is indeed a tangible and strategically meaningful asset for the company. The goal was to encourage user participation by providing customer-centric data platforms and insights services, innovating marketing plans, thereby creating value for consumers and sellers. Dr. Usama Fayyad was Yahoo's Chief Data Officer. In an interview with Gregory from KDnuggets, he introduced some successful cases of data mining at Yahoo.
"Product Integration: An example is what you see today in Yahoo Mail, the visual results of data mining. Through analyzing unexpected patterns in user behavior, we discovered a strong correlation between reading emails and browsing news during each session. We conveyed this finding to the Yahoo Mail product team, who first wanted to verify the impact of this relationship: displaying a news module on the homepage of a test group of users, with the news headlines prominently displayed."
"For products like email, the biggest challenge is how to acquire new 'lightweight users' and increase their usage, turning them into 'heavyweight users.' If you succeed, then the churn rate will significantly decrease. In fact, in our experiment, the most obvious group saw a 40% reduction in churn rate. Thus, Yahoo immediately developed and perfected the news module, embedding it on the Yahoo Mail homepage. Now, hundreds of millions of consumers can see and use this product. I like to mention this story because it well illustrates the quick response ability of our product team, and proves that there are many extremely valuable potential patterns hidden in user behavior data."
"Instant Messaging: We analyzed the usage of Yahoo Messenger to understand the key factors driving usage. It turned out that the most important factor was expanding the users' 'Buddy List,' adding at least five new friends. Based on this, Yahoo carefully designed corresponding marketing activities, encouraging users to add more friends to their Buddy List, significantly boosting Yahoo Messenger's usage."
"The Search Box on Yahoo's Homepage: A simple example is that we found placing the search box in the center of Yahoo's homepage (rather than the previous left side) increases user usage. This not only promotes active user engagement but also costs Yahoo nothing. The discovery process was interesting. We first noticed that Netscape browser users used the search feature more than IE users. Further investigation revealed the only visual difference between the two browsers was the position of the search box! The search box was centered in Netscape but closer to the left in IE. Such a subtle difference was crucial. Who would have thought?"
### What is Data Mining?
For what is data mining, many scholars and experts have provided different definitions. Below, we list several common explanations:
"Simply put, data mining is the extraction or 'mining' of knowledge from large amounts of data. The term is actually a bit misnamed. Data mining should more accurately be named 'mining knowledge from data,' unfortunately, it's a bit long. Many people consider data mining synonymous with another common term 'Knowledge Discovery in Database' or KDD. Others see data mining as just a fundamental step in the process of knowledge discovery in databases." ―― *Data Mining: Concepts and Techniques* (J. Han and M. Kamber)
"Data mining is the analysis of observed datasets (often very large), aiming to discover unknown relationships and summarize the data in fresh ways that are understandable and valuable to the data owner." ―― *Principles of Data Mining* (David Hand et al.)
"The entire process of obtaining useful knowledge from data using computer-based methods, including new technologies, is called data mining." ―― *Data Mining - Concepts, Models, Methods, and Algorithms* (Mehmed Kantardzic)
"Data mining, simply put, is automatically discovering relevant patterns from a database." ―― *Building Data Mining Applications for CRM* (Alex Berson et al.)
"Data mining (DM) is the process of extracting hidden predictive information from large databases." ―― *Data Mining: Opportunities and Challenges* (John Wang)
As the first Chinese expert in the field of data mining, Professor Jiawei Han provides a clearer definition in his teaching slides for *Data Mining: Concepts and Techniques*: "Data mining is the process of extracting interesting (non-trivial, implicit, previously unknown, and potentially useful) information or patterns from large databases."
Here, we can see that data mining has the following characteristics:
- **Based on large data**: Not to say that small data sets cannot be mined. Most data mining algorithms can run on smaller data sets and produce results. However, on one hand, small data sets can completely be analyzed manually to summarize patterns. On the other hand, small data sets often fail to reflect universal features of the real world.
- **Non-triviality**: The knowledge extracted should not be simple or obvious, like the comment from a famous sports commentator: "After my calculations, I found an interesting phenomenon: by the end of this game, the number of goals scored and conceded in this World Cup are the same. Very coincidental!" This kind of knowledge is not acceptable. This might seem unnecessary to state, but many inexperienced data miners who lack domain knowledge often make this mistake.
- **Implicitness**: Data mining aims to uncover hidden knowledge within data, rather than surface-level information directly visible in the data. Common BI tools like reports and OLAP can easily allow users to find such surface-level information.
- **Novelty**: The knowledge extracted should be previously unknown; otherwise, it merely verifies the experience of business experts. Only entirely new knowledge can help businesses gain deeper insights.
- **Value**: The results must provide direct or indirect benefits to the enterprise. Some say that data mining is just a "dragon-slaying skill," seemingly magical but ultimately useless. This is a misunderstanding. It is undeniable that in some data mining projects, poor results may arise due to unclear business objectives, insufficient data quality, resistance to changing business processes, or insufficient experience of the data miners. However, numerous successful cases demonstrate that data mining can indeed become a powerful tool to enhance efficiency.
It is difficult to trace when the term "data mining" became widely accepted, but it probably began to rise in the 1990s. There is an interesting anecdote. In academic circles, the term "Knowledge Discovery in Databases" (KDD) was initially always used. At the first KDD international conference, the committee debated whether to continue using KDD or switch to "Data Mining." Finally, a vote was held to decide based on the majority choice. The voting results were dramatic: out of 14 committee members, 7 voted for KDD and 7 for Data Mining. Then, one elder suggested that "the term 'data mining' is too vague; research should involve knowledge," so in the academic community, the term KDD continued to be used. In the commercial sector, since "Knowledge Discovery in Databases" seemed too lengthy, the simpler and more accessible term "Data Mining" was widely adopted.
Strictly speaking, data mining is not a completely new field; it has a bit of a "new bottle for old wine" characteristic. The three pillars of data mining include research results from statistics, machine learning, and databases, along with content from visualization and information science. Data mining incorporates regression analysis, discriminant analysis, cluster analysis, and confidence intervals from statistics, decision trees and neural networks from machine learning, and relational analysis and sequential analysis from databases.
### What Can Data Mining Do?
Data mining has many uses. Here, I want to briefly discuss it from both technical and application perspectives.
From a technical standpoint, data mining knowledge can be roughly divided into two major categories: descriptive mining and predictive mining. Descriptive mining refines and deduces existing data, extracting more macroscopic concepts that reflect data characteristics. For instance, a bank has millions of customers, and its data warehouse stores detailed data such as demographic information, account information, transaction information, and customer service contact information for each customer. However, the bank cannot clearly understand what type of customer each individual is and what their consumption patterns are. At this point, it is usually necessary to segment all customers into several groups, ensuring that customers with similar behaviors and values are placed in the same group. With these customer segments, the bank can more easily discover marketing opportunities and formulate marketing strategies. The mining technique used in this example is the clustering model, which is a typical descriptive mining method.
Predictive mining, as the name suggests, involves building a mining model with predictive capabilities. These predictive capabilities may include predicting which customers will churn next month, which customers will respond positively to promotional activities, which customers' future values will grow, and by how much, etc. Predictive mining often has stronger guidance for business operations, thus taking effect faster.
From an application perspective, data mining can be applied to many industries, including telecommunications, banking, securities, insurance, manufacturing, the Internet, etc. Without discussing specific industry-specific applications, data mining is generally applied in Customer Relationship Management (CRM) across industries. The applications of data mining in CRM include customer segmentation, customer value analysis, customer acquisition, customer retention, cross-selling, and upselling, etc. Additionally, credit scoring, fraud detection, and text mining are also common applications.
Customer segmentation has already been discussed in the example of descriptive mining, so I won't elaborate further.
Accurately evaluating customer value is key to a company's successful operation. Customer value here not only includes the revenue currently brought by the customer but also includes the various costs incurred on the customer and the value the customer will bring to the company in the future. Combining the customer's current value and future value forms a comprehensive evaluation of the customer's value throughout their lifecycle (from becoming a customer to eventual churn), known as LTV (LifeTime Value). Once we clearly understand customer value, we can differentiate and treat customers accordingly, striving to retain high-value customers, promote medium-value customers to become high-value ones, and provide unequal services for customers of different values.
In the figure below, data mining applications at various stages of the customer lifecycle are shown. The horizontal axis represents time, while the vertical axis represents the profit the customer brings to the company at different time points. The entire lifecycle can be divided into four phases: initiation phase (from potential customers to new customers), development phase (gradually expanding the scope and quantity of products used), maturity phase (contributing maximum profit to the company), and termination phase (gradually drifting away and churning).
At different stages of the customer lifecycle, what can data mining do for us? In the initiation phase, since potential customers have not yet had much interaction with the company and lack understanding and awareness of various brands and products, they are still observing to see if there is a suitable product for them. At this stage, potential customers generally have low loyalty to the company. We can use data mining techniques to help lock in target potential customer groups, analyze existing customers and marketing activities to discover who is most likely to become our customer, what promotional means and channels can more effectively attract them, and evaluate how much benefit bringing them on board will bring to the company. This type of data mining application is called "customer acquisition."
In the development phase, customers use the company's products and services relatively infrequently and in small quantities. At this stage, data mining techniques can be used to activate dormant customers, stimulate users to purchase more different products (cross-selling), or expand the purchase volume of existing products (upselling). Association analysis in data mining can help companies discover which products are most closely related, prediction techniques can help us understand whether customers will respond positively to specific marketing activities, and clustering techniques can help us find customer groups with similar behaviors and preferences, thereby further promoting customers to develop into high-value customers.
In the maturity phase, the customer's contribution to the company's profits has already reached its peak. However, at this stage, the company cannot rest on its laurels but should think ahead, strictly preventing the loss of high-quality customers, and promptly responding to fierce market competition. At this stage, predictive techniques in data mining can be used to discover early which customers have shown abnormal behavior and may churn, and take targeted retention actions.
In fact, throughout the entire customer lifecycle, we must continuously analyze customer behavior and value, always keeping track of their preferences and changes. Only in this way can we strengthen the company's insight into customers and guide and promote operations effectively. And all these analyses are things that data mining can help us achieve.
### Data Mining Process and Mainstream Tools
Due to space limitations, this article does not intend to elaborate on the technical aspects of data mining. Readers can refer to classic textbooks to gain relevant knowledge, such as *Data Mining: Concepts and Techniques*, *Principles of Data Mining*, *Machine Learning*, etc. Generally speaking, common data mining techniques include: clustering algorithms for customer segmentation, association analysis and sequence analysis algorithms for cross-selling, predictive algorithms such as decision trees, neural networks, and regression for customer value analysis, churn analysis, and cross-selling, text mining and Web analysis for the Internet, etc.
Eric King, in his article "How to Invest in Data Mining: A Framework to Avoid Expensive Pitfalls in Predictive Analytics" (published in the October 2005 issue of *DM Review*), argues that data mining is a journey, not a destination. He defines this journey as the data mining process. This process includes the following elements:
- A discovery process
- A flexible framework
- Proceeding according to a clearly defined strategy
- Multiple review points
- Periodic evaluations
- Allowing adjustments to functions in feedback loops
- Organized as an iterative architecture
Many data mining tool vendors have simplified this process to make it clearer. SAS divides the data mining process into five stages: Sampling (Sample), Exploration (Explore), Processing (Manipulate), Modeling (Model), and Evaluation (Assess). Previously, people often used a cyclical water cooler to metaphorically describe the data mining process. Water (data) first surges to the first layer (analysis phase), forming vortices (refinement and feedback). Once enough "processed" water accumulates, it overflows into the next lower layer. This "processing" continues until the water reaches the lowest layer. There, it is pumped back to the top layer, starting a new round of "processing". Data mining is very similar to this hierarchical iterative process. Even within many data mining algorithms, the internal processing is similar, such as the neural network algorithm, which runs multiple times (epochs) on the dataset until the optimal solution is found.
However, using a water cooler to metaphorically describe the data mining process is not entirely appropriate because it does not reflect the feedback loop, which is quite common in the data mining process. For example, through data evaluation, anomalies in the data can be discovered, requiring more data to be extracted from the source system. Or, after modeling, it may be found that more record capabilities are needed to reflect the overall distribution.
"To excel in one's work, one must first sharpen one's tools." When enterprises plan to use data mining to improve operations, choosing the right data mining tools becomes very important. Tool selection is typically considered from the following angles (while also needing to combine with the enterprise's level of informatization, specific business objectives, data volume to be processed, changes in business processes, etc.):
- Data access capability: Can it access various types of data, and how efficient are the data interfaces?
- Data preparation capability: Data processing capabilities, including sampling, filtering, transformation, integration, exploration, etc.
- Breadth and depth of model algorithms: Does it support various mining algorithms, comparisons, and arrangements of multiple models?
- Visualization capability: Various graphic displays, interactive operations
- Performance: Support for hardware and software platforms, parallelism, multi-CPU, multi-threading, distributed architecture
- Support for various users and industry solutions
- Other capability supports: Chinese support, user-friendly interface, batch processing, APIs, metadata management, etc.
Enterprises can also refer to third-party evaluation agencies' assessment results to choose data mining tools. Relatively authoritative evaluation agencies include Gartner, IDC, etc. Below, we quote part of the content from Gartner's "Customer Data Mining Magic Quadrant" evaluation report published in the second quarter of 2007 to briefly introduce mainstream data mining products.
"Recently, the well-known software evaluator Gartner evaluated the software in the data mining field. The final result is that SAS and SPSS, as well as traditional positions in the field, remain in the leading quadrant of data mining. Rising stars are KXEN and Portrait Software, emerging as visionary companies. The challenger quadrant is blank, and other more than ten vendors occupy niche markets."
"In this evaluation, nine companies were selected: SAS, SPSS, KXEN, Portrait Software, Angoss Software, Unica, ThinkAnalytics, Fair Isaac, Infor CRM Epiphany. This represents the current market situation. In the Chinese market, the main data mining tools are SAS, KXEN, and SPSS."
In the first-quarter 2006 evaluation report, Chordiant and Teradata vendors were also included.
The evaluation results are shown in the figure below. The evaluation criteria are mainly divided into two dimensions: execution capability (vertical axis) and completeness of vision (horizontal axis). The evaluation of execution capability includes seven evaluation standards: product/service, market response and tracking record, overall viability, customer experience, market execution, sales execution/pricing, and operational capability. The evaluation of vision completeness includes eight evaluation standards: product strategy, market understanding, market strategy, sales strategy, vertical/industry strategy, business model, innovation capability, and geographical strategy.
Figure: Gartner Customer Data Mining Magic Quadrant (Second Quarter 2007)
In the above figure, mainstream data mining vendors are divided into four quadrants: Leaders (Leaders), Challengers (Challengers), Visionaries (Visionaries), and Niche Players (Niche Players). Below, we briefly introduce the two leading companies in the data mining field, SAS and SPSS.
**SAS**
In the data mining market, SAS is the largest vendor, with numerous analysts, the most customer experience, and is the traditional standard tool for data mining. Outsourcing and service providers are very familiar with SAS products.
SAS has the most complete data