Sunstone's profileIDMer (数据挖掘者)PhotosBlogListsMore ![]() | Help |
|
7/14/2007 数据挖掘工具:谁最适合CRM?(第1部分)编译:Sunstone Zhang (http://sunstonezhang.spaces.live.com)http://www.dmreview.com/article_sub.cfm?articleId=1046025发表于DM Direct Special Report January 24, 2006 Issue 自从我上次斗胆回答“如何选择数据挖掘工具”之后,已经好几年过去了。本文主要阐述以下两个核心观点: 1. 没有最好的工具;更确切地说,没有适合所有人的最好的工具。 2. 最有用的工具,是那些能够满足你所需要的绝大多数数据挖掘任务的工具。 主要的数据挖掘任务过去,数据挖掘工具的开发主要侧重于提供强大的分析算法上。 但是,分析“引擎”只能完成数据挖掘项目中的一小部分任务。 大多数数据挖掘人员都明白,数据挖掘项目中70%到90%的工作是做数据准备。 在数据挖掘工具的演进过程中,数据准备功能的开发一直被放在次要位置上。 最后,你要能够对模型准确评估,才能比较多个模型,并将它们推荐给市场人员。 数据准备任务常见的数据准备任务包括:
大多数数据挖掘工具会把这些数据挖掘功能放在次要的地位, 本文则会侧重评估常见数据挖掘工具处理这些任务的能力。 除了能支持以上的数据准备任务,一个好的数据挖掘工具还应该包含模型评估的功能,以便比较建模过程中产生的多个模型,并用于支持直效营销(direct marketing)。 模型评估工具在分析理论中,最好的模型是具有最佳精度的模型,可以准确预测出目标变量的类别,同时在验证数据集上也能表现稳定。 这就是说,在预测中我们要考虑响应目标和非响应目标的组合精度。 这种方法称为全局精度方法(Global Accuracy method)。 大多数数据挖掘工具使用这种方法来确定“最佳”模型。 但是,它也有美中不足。 全局精度评估方法的背后有一个前提假设,就是各种分类错误的代价是相同的。 这种方法在课堂上表现不错,但在实际的CRM数据挖掘应用上则可能存在问题,特别是在那些用于直邮营销的应用上。 实际上,这也是过去很多用CRM来支持直邮营销而未能产生明显商业价值的一个主要原因。 对模型的评估有一些主要原则,而其中只有一部分是营销部门真正关心的: 最大化目标客户的响应率,最小化所需成本。 大多数数据挖掘工具都把注意力集中在预测的组合精度上,却完全忽略了成本的因素。 在直效营销活动中,向未响应的潜在客户(称为“错误肯定”错误,false-positive)发送邮件的成本是相当低的;而如果一个潜在客户可能会响应(称为“错误否定”错误,false-negative),你却没有向他发送邮件,那么这个代价就相当大了(因为没有把他发展为客户,您会损失他所缴纳的会员费,而且他也不可能购买您的其它服务)。 因此在直销营销模型的评估中,就应该尽量最小化错误否定的错误,而不是错误肯定。 因为营销部门只关注响应率和成本,如果前30%的客户名单中包含了全体响应者的60%,就可以满足他们的需求。 对于直销营销来说,尽管前30%的客户仍会有部分人不会响应(错误肯定错误),向他们发送邮件依然是值得的。那是因为我们已经联系了全体响应者中的60%。 此时就比随机发邮件的有效性提高了一倍,也就更加合算。 大多数数据挖掘工具都使用全局精度方法来进行模型评估。 它们可能会要求你使用这种方法,通过工具的报表功能来识别出“最佳”模型。 不同算法会产生多个模型,我们不应该只是查看工具提供的精度报告,简单比较后就判别哪个是最佳的模型。 实际上,更合适的评估应该根据如下条件来做出:按照预测概率值顺序排列模型结果,生成评分列表,然后看真正的响应者是否被放在最前面的分段中。 尽管分类算法可以输出分类概率, 实际的类别(例如,0或1)还是对分类概率的进一步归纳(例如,<0.5 = 0; ≥ 0.5 = 1)。 大量真正的“金块”隐藏在数据挖掘工具的功能模块之中。 初级的CRM挖掘人员会把注意力放在分类和精度上面,但真正的“金块”应该是客户保持、购买倾向以及新客户获取的概率值。 我们应该查看累积提升表(cumulative lift table;例如表1),来判别模型是否真正有效地把正确肯定(true-positives)放在了靠前的分组里。 累积提升表可以通过以下方式创建:
表1: 提升表 译者注: Decile-分组序号;Hits-命中数,即每组内包含的实际响应数,等于TP+FN; TP-正确肯定;FN-错误否定;TN-正确否定;FP-错误肯定; (TP和FN对应于实际的响应,TN和FP对应于实际的非响应) Random Hits-随机命中数,即随机期望值,等于SUM(TP+FN)/10; % of Total-召回率,等于Hits/SUM(Hits)*100; Cum % of Total-累积召回率,是% of Total的累积值。
一共划分了10个分组,实际的总响应数为SUM(Hits)=275,因此每组的随机期望值为275/10=27.5。第一组的命中数为81,明显超过了随机期望值,其召回率=81/275=29.45%。第二组的命中数为43,也超过了随机期望值,其召回率为43/275=15.64%,累积召回率等于第二组的召回率加上前面所有组(即第一组)的召回率,等于15.64%+29.45%=45.09%。
从上表中可以看出,该模型划分肯定和否定的阈值应该是在第二个分组中,这样才出现了第一组都被预测为肯定,但其中有81个是正确的肯定(TP),而735个是错误的肯定(FP);第二组中则同时包含了TP、FN、TN和FP;从第三组之后则都被预测为否定(因为位于阈值之下),因此包含了FN和TN。
正确肯定(True-Positives,TP): 实际的响应中,被正确预测为响应的个数 错误否定(False-Negatives,FN): 实际的响应中,被错误预测为非响应的个数 正确否定(True-Negatives,TN): 实际的非响应中,被正确预测为非响应的个数 错误肯定(False-Positives,FP): 实际的非响应中,被错误预测为响应的个数 通过对提升表的分析可以看到,在第四个分段之后,增量提升(incremental lift,第8列中的”% of Total”)下降到随机期望(每个分段为10%)之下,而前四个分段包含了超过70%的响应。 从下面的增量提升曲线(图1)中可以明显看出增量提升和随机期望的交叉点。
在增量提升曲线中标示了各个分段的命中数。 在图1中可以看到,增量提升曲线在第4个分段后和随机期望线(275个响应的10%,即平均每个分段27.5个响应)交叉。 不管营销经理怎么去看,上述的表格和图形都可以把必要的信息传递给他们。 营销人员可以借助模型评估工具,来设定要给多少个客户发邮件。 以表1为例,营销人员可以向前四个分段的客户(占整个评分名单的40%)发邮件,并预期可以命中70%的潜在响应客户。 我们现在已经了解该如何评估数据挖掘模型,接下来就可以深入分析和调整业务流程,借助模型的结果来提高企业的盈利。 业务流程包括:
数据挖掘过程Eric King在“如何在数据挖掘上投资:避免预测型分析中昂贵的项目陷阱的框架”一文(发表于2005年10月的“DM Review”)中主张数据挖掘是一段旅程,而非终点。他把这段旅程定义为数据挖掘过程。 该过程包含如下要素:
过程模型很多数据挖掘工具的厂商都对这个过程进行了简化,使之更加清晰。 SAS将数据挖掘过程划分为五个阶段: 抽样(Sample),解释(Explain),处理(Manipulate),建模(Model),评估(Assess)。 过去人们常用循环式的饮水器来比喻数据挖掘过程。 水(数据)首先涌上第一层(分析阶段),形成漩涡(精炼和反馈),等到聚积了足够多“已经处理过”的水之后,就溢出来流到下一个更低的层中。 不断地进行这种“处理”,直到水流到最低层。在那里它被抽回顶层,开始新一轮的“处理”。 数据挖掘和这种层次式的叠代过程非常相像。 甚至在很多数据挖掘算法的内部处理也是如此,比如神经网络算法,就是在数据集上多次运行(epochs),直至发现最优解。 Insightful Miner已经在其用户界面中内建了简单过程模型。 这种集成可以帮助用户把必要的数据挖掘任务组织起来,让任务能够按照正确的顺序来处理。 但使用饮水器来比喻数据挖掘过程还不算恰当,因为它没有反映出反馈环路,而反馈环路在数据挖掘过程中是很常见的。 例如,通过数据评估可以发现异常的数据,从而要求从源系统中抽取更多的数据。 或者,在建模之后,会发现需要更多的记录才能反映总体的分布。 在CRISP过程模型中进行了解决这个问题的尝试,该模型是由Daimler-Benz、ISL (Clementine的开发者)和NCR共同制定的。 CRISP同时也被集成到Clementine挖掘工具(现在属于SPSS公司)的设计中。 CRISP几乎反映了完整的数据挖掘环境。
使用数据建模其实和做陶土模型或者大理石模型差不多。 艺术家首先从一大堆材料开始着手,经过许多次的加工和检查,才诞生了最终的艺术品。艺术家首先从一大堆材料开始着手,经过许多次的加工和检查,才诞生了最终的艺术品。 很多人在建模过程中常常没有充分理解建模的本质,由此带来了一系列问题,使得建模变得很复杂。 Eric King发现数据挖掘是一个循环的过程(就象上图中的CRISP流程图),而非线性的过程。 这种循环式的数据挖掘过程会让您想起Wankel转式汽车发动机。 这种发动机是一圈一圈旋转的(而非上下运动),不断输出动能来驱动汽车。 与之相似,数据挖掘过程也是不断循环,产生信息来帮助我们完成商业目标。 信息就是推动商业的“能量”。 在挖掘过程中会有很多对前一个阶段的反馈(例如,在完成初步建模之后可能需要获取更多的数据)。 不过,在CRISP流程中还是遗漏了一个要素——那就是对数据仓库或源系统的反馈。 前一次CRM营销活动的结果应该导入数据仓库,为后续的建模提供指导,并能跟踪营销活动间的变化趋势。 我在CRISP流程图中加入了这些反馈,以红线表示(见图2)。
7/2/2007 FW: Data Mining Tools: Which One is Best for CRM? (Part 1)
http://www.dmreview.com/article_sub.cfm?articleId=1046025 Article published in DM Direct Special Report It has been several years since I ventured forth to answer the question, "How to Choose a Data Mining Tool Suite." That article was organized around two central concepts:
Major Data Mining TasksIn the past, data mining tool development has focused primarily on providing powerful analytical algorithms. However, the analytical "engines" handle only a small part of the complete task load in a data mining project. As most data miners know, 70 to 90 percent of a data mining project is consumed with data preparation. Development of tools for data preparation has taken the backseat in most data mining tool evolution. Finally, you must be able to evaluate models properly, in order to compare models, and commend them to marketing staff. Data Preparation TasksCommon data preparation tasks include:
Most data mining tool sets only "minor" on these important data mining tasks. This evaluation will "major" on the ability of common data mining tools to facilitate these tasks. In addition to providing tools for doing important tasks of preparing data for modeling, a good data mining tool for direct marketing should include tools for evaluation of the models created by the modeling exercise. Model Evaluation ToolsIn analytical theory, the best model is one that has the greatest accuracy in predicting all classification states of the target variable and is acceptably robust in its agility to perform well on the validation data set. That means we must consider the combined accuracy of predicting responders and nonresponders. This approach is called the Global Accuracy method. Most data mining tools use this method to identify the "best" model. However, there is a "fly" in this ointment. Embedded in the theory behind the Global Accuracy evaluation method is the assumption that the costs of all types of classification errors are the same. This approach works well in the classroom, but it does not work well in CRM data mining operations, particularly those that drive direct mail (DM) campaigns. In fact, this is one of the major reasons why many CRM initiatives to support DM campaigns have failed to produce much business value in the past. Models have been evaluated largely on a basis that is only partly relevant to the only things that marketers care about: maximizing positive customer response and minimizing the cost of doing so. Most data mining tools focus on the combined accuracy of prediction but ignore the cost element entirely. In DM campaigns, the cost of mailing to a prospect that does not respond (referred to as a "false-positive" error) is rather small; but the potential cost of not mailing to a prospect that would have responded ("false-negative" error) can be rather large (reflected in the lifetime value of membership fees not paid and other services not purchased). This means that DM model evaluation methods should focus on minimizing the false-negative errors, rather than the false-positive errors. Because marketers care only about response rates and costs, a mailing to the top three deciles that hits 60 percent of the responders is likely to satisfy both concerns. Mailing to the non-responders (false-positive errors) in the top three deciles is an acceptable cost to the direct marketer for the sake of contacting 60 percent of the total responders available in the target area. This situation represents a 100 percent lift over random expectation and is much more cost-effective than a mass mailing approach. Most data mining tools employ the global accuracy method for model evaluation. You may be forced to accept this method to identify the "best" model using the tool's reporting capabilities. The best model among many performed with different algorithms should not be evaluated by comparing the accuracy reports of each tool. Rather, evaluation should focus on how well the model clusters the positive responders in the top deciles of a scored list sorted on the prediction probability. Even classification algorithms can output classification probabilities. The actual classification (e.g., 0 or 1) is a highly summarized expression of the classification probability (e.g., <0.5 = 0; ≥ 0.5 = 1). Here lies a lot of the true "gold" hidden in the capability set of the tool. The naive CRM data miner will focus on the classification and accuracy thereof, but the true "gold" of CRM data mining must be expressed in terms of probabilities for retention, purchase and new customer acquisition. A cumulative lift table (e.g., Table 1) must be inspected to determine how effective the model is in clustering true-positives in the upper deciles. This table can be created by:
Table 1: Lift Table with Coincidence Counts True-Positives (TP): the number of correctly predicted responders False-Negatives (FN): the number of incorrectly predicted responders True-Negatives (TN): the number of correctly predicted non-responders False-Positives (FP): the number of incorrectly predicted non-responders The analysis of the lift table shows that the incremental lift (percentage of total in the eighth column) declines below the random expectation (10 percent per decile) after the fourth decile, containing over 70 percent of the total responders. This crossover to negative lift can be seen graphically in an incremental lift curve (Figure 1) below.
The incremental lift curve graphs the number of hits in each decile. In Figure 1, the curve crosses the random expectation line (10 percent of the total of 275 positives = 27.5 per decile) after the fourth decile. Presentation of the results in tabular and graphical forms will communicate the necessary information to market managers, no matter how they think. These model evaluation tools can be used by marketers to set the number of customers to mail to. Table 1 shows that a marketer could mail to the top four deciles (40 percent of the total scored list), and expect to hit over 70 percent of the potential responders. Now that we have a clear understanding of how to evaluate DM models properly, we can look closer at the business processes that must be coordinated by data mining tools that can leverage model results to increase corporate profitability. These business processes include:
The Data Mining ProcessEric King maintains that the most important aspect of data mining is the journey, not the destination in "How to Buy Data Mining: A Framework for Avoiding Costly Project Pitfalls in Predictive Analytics" that appeared in DM Review in October 2005. He defines this journey as the "process" of data mining. He describes the major elements of this process as:
Process ModelsVendors of several data mining tool packages have simplified the process for the sake of clarity. SAS has collapsed the data mining process into the five stages: Sample, Explain, Manipulate, Model, Assess. One metaphor that has been used the past to describe the data mining process is a recirculating water fountain. Water (data) flows onto the first level (phase of analysis), forming eddies (refinements and feedbacks) until enough "processed" water accumulates to spill over to the next lower level. The "processing" continues until it reaches the lowest level, where it is pumped back to the top, and the "process" begins again. Data mining is a lot like this iterative cascading process. Even the internal processing of many data mining algorithms like neural nets is accomplished through many runs (epochs) through the data set until the "best" solution is found. (Insightful Miner) have built versions of a simple process model into their user interfaces. Such integration of the data mining process into the tool interface helps the user to organize the necessary data mining tasks in proper processing order. The problem with the water fountain analogy is that there is no reflection of the feedback loops that often occur in the data mining process. For example, data assessment might uncover some anomalies that require extraction of additional data from source systems. Or, after modeling, it may become apparent that additional data records are needed to adequately represent the parent population. One attempt to address this problem was embodied in the CRISP process model created by a consortium of Daimler-Benz, ISL (developer of Clementine) and NCR. The CRISP is an integral part of the Clementine tool design (now owned by SPSS). CRISP comes closest to encompassing the entire data mining context. Modeling with data is much like modeling with clay or marble. The artist starts with a lump of material, and with many rounds (iterations) of manipulation and inspection, the art piece gradually reaches its final form. Modeling with data is complicated by the additional problem of not sufficiently knowing the nature of the modeling medium until midway through the modeling process. Eric King observes (rightly) that the data mining process is circular (like the CRISP process diagram above shows it to be), rather than being a linear process. The operation of the circular data mining process might remind you of the Wankel rotary automobile engine. The engine goes round and round (instead of up and down), pumping out kinetic energy in the form of rotory motion used to move the car. Likewise, the data mining process goes round and round and pumps out information that can be used to accomplish business goals. This information is the "energy" used to fuel business. There are many feedbacks to previous stages in the process (e.g., acquisition of additional data after preliminary modeling is done). There is one element missing in the CRISP process, though - the element of feedback to the data warehouse or source data systems. Results from previous CRM campaigns should be entered into the data warehouse to provide insights for subsequent modeling operations and permit tracking of trends across campaigns. These feedbacks are superimposed on the CRISP process as dotted lines (Figure 2).
Robert A. Nisbet, Ph.D., is an independent data mining consultant with over 35 years experience in analysis and modeling in science and business. You can contact him at Bob@rnisbet.com or (805) 685-0053. 6/25/2007 把这个作为双语版的Blog吧 (Bilingual blog of Data Mining)
我的另一个Blog是:http://idmer.blogger.org.cn,中文版本。 (本文用Windows Live Writer编写) IDMer (数据挖掘者)I, Data Miner; You, can talk everything about data mining! 我是数据挖掘者,欢迎探讨一切关于数据挖掘的问题! |
|
||||||
|
|