【校稿】第四十五篇 A five-pronged approach to analyze process data
本帖最后由 小编D 于 2011-12-15 17:24 编辑
Make DataMatterA five-prongedapproach to analyze process data
byRonald D. Snee
is data analysis an art or a science? Arguments exist for both sides,and many people simply come down in the middle. In my mind, I believe it’sboth.
数据分析是一门艺术或科学? 两种争论一直存在,很多人就简单的持中立态度。而在我看来,数据分析即是艺术也是科学。
Regardless of which view you take, the discussion misses acritical element—the need for an explicitly articulated strategy for dataanalysis. In fact, the various attitudes toward the nature of data analysisoften imply unreflective strategies.
无论你持有哪种观点,一个非常关键因素是不能少的,数据分析需要有一个明确而且逻辑非常缜密的策略。事实上, 很多数据分析方法都没有明确的策略。
Partisans of data analysis as an art simply might look at thedata, manipulate it based on their intuition and experience, and proceedconfidently to extract what they believe is useful information. The morescientific folks, with perhaps too much faith in numbers, go straight tostatistical software and do some indisputable number crunching.
Those who stand on middle ground—possibly the great majority ofpractitioners—do a little of both: rely on their insight to manipulate thedata, run the numbers, do some further manipulation and rerun the numbers untilthey achieve what they believe is a satisfactory result.
All of those approaches are likely to produce questionableresults in terms of what the analysis addresses and the significance of theresults.
Five activitiesPractitioners can avoid the pitfalls of these unreflective orad hoc approaches by adopting a clearly articulated, proven strategy foranalyzing process data and systematically following that strategy.1Such astrategy entails five essential activities:
Note that these are iterative, as opposed to sequential, activities. Depending on the circumstances, the order of some of these activities may shift.
For example, in the mutually dependent iterations of this approach, the graphical representation of the process may precede the examination of the pedigree. In any case, most of these activities look forward and backward. The examination of the data’s pedigree—where it came from and how it was collected—may drive the analyst back to a fuller exploration of the context of the process to fill out that pedigree.
But the pedigree of the data also points to how the process should be graphically represented. That, in turn, could retrospectively suggest the need for additional types of data and prospectively affect the graphical representation. By engaging iteratively in these activities, you can arrive at important results that are ready to be fully and persuasively reported.
This approach offers at least three distinct advantages over less structured approaches. First, it is repeatable—it can be used in any situation that calls for the greater understanding of a process. Second, like sound processes themselves, it’s robust—flexible enough to encompass the wide variation of particulars to be found in different situations. Third, and most importantly, it’s more likely to produce useful results.
Understanding the context
It’s difficult to know precisely how to proceed until you ask the most basic of questions: What is the purpose of the analysis? Are you trying to confirm a hypothesis?
For example, a manufacturer that uses raw materials from two different vendors suspects that differences in quality are causing defects in the finished product. Data analysis can confirm or disconfirm the hypothesis and, in this example, identify the offending vendor. Such contexts call for what is sometimes referred to as confirmatory data analysis.
Alternatively, let’s say you’re trying to solve specific problems, the causes of which you do not understand. For example, a chemical process is producing unacceptable variations in purity from batch to batch. Or a business process, like a bank loan approval process, is taking far too long to complete. Or, perhaps a distributor’s percentage of on-time deliveries is fluctuating widely. These contexts call for exploratory data analysis, which must first have a hypothesis to test.
In confirmatory and exploratory analyses of a process, the goal is the same: find the inputs and the controlled and uncontrolled variables that have a major impact on the output of the process.2
Examining the pedigree
Data analysis begins with a data table, which is either provided to or constructed by the analyst. In either case, you should always question the data because data can be, among many other things:
• Incorrect: Some of the information is wrong—for example, when someone monitoring a process records the data incorrectly or a measurement device is faulty.
• Irrelevant: Some of it is the wrong information—for example, when data on the wrong variables are captured.
• Incomplete: Crucial information is missing—for example, when data on an important variable are missing.
• Misleading/biased: Data points you in the wrong direction for analysis—for example, when an important variable has been examined only over a short time, thus making it appear to be a constant.
An understanding of the context of the process can guard against these errors, but the context alone is insufficient. Given these and the many other shortcomings that can undermine the value of the data, it is absolutely critical to understand the pedigree of the data—where it came from and how it was collected.
For example, consider a batch manufacturing process in which a sample is taken every shift and carried to an analytical lab where it is tested for purity, and the results are recorded. Thus, the data trail is:
Production process ► sampling process ► testing process ► data-logging process.
To understand the resulting data, it is necessary to understand this data trail and the production process parameters. That is the pedigree of the data.
Incomplete understanding of the data’s pedigree can lead you down wrong analytical trails. Suppose, for example, a pharmaceutical company is experiencing differences in yield from batch to batch of a product because of the properties of the raw materials supplied by a vendor. Although the properties for each batch of raw materials are within specifications, the yield nevertheless varies unacceptably.
The analyst has been given a data table that includes the properties of the raw materials for each batch of product under consideration. But if the analyst does not know that some raw material batches were analyzed by the vendor’s quality assurance lab and some by the manufacturer, then there is a strong possibility the analysis will come up empty. By taking the time to understand the pedigree of the data fully, the analyst can save much frustration and fruitless work.
分析师得到一张包括每一批原材料属性的数据表 ,但是,如果分析师不知道部分原材料是由供应商的质量保证实验室提供和另外一部分是有制造商提供的话,分析最后结果很有可能一无所获。所以花一些时间去全面的了解数据的系谱,可以减少分析时候的挫折感和做一些徒劳的工作。
Some Guiding Principles
• The process provides the context for the problem being studied and the data being analyzed.
• Know the pedigree of the data—the who, what, when, where, why and how of its collection.
• Analysis is defined by how the data were generated.
• Understand the measurement system as well as the process.
• Be aware of human intervention in a process. Humans are often a large source of variation.
Graphing the process
A graphical representation of the process shows how the process works from end to end. Such representations fall into two broad categories: flow charts and schematics. A flow chart maps the sequence and flow of the process and often includes icons, such as pictures of a truck to represent a transportation step or smokestacks to indicate a factory.
A schematic representation is designed to exhibit the inputs and the controlled and uncontrolled variables that go into a process to produce its outputs. Both types of representation reinforce one another by suggesting what types of data are needed, where they can be found and how they can be analyzed.
Figure 1 is an elementary schematic representation of a process (such as pharmaceutical, chemical or loan approval As the analyst knows, the context is unacceptable variations in yield from batch to batch of the finished product. Therefore, “yield” is the key output.
Toget an accurate picture of the process again, however, analysts should notsimply rely on the context. To find out how the process really works, theyshould also observe the process first-hand and question the people who operateit. This investigation might also lead the analyst to further refine the pedigreeof the data—the who, when and why of its measurement and collection.
Withyield as the key output of a manufacturing process, the analyst can nowgraphically represent the process and fill in the blanks with the sources ofpossible variation that led to the unacceptable variations in yield. For theinputs, sources of variation might be energy, raw materials and different lotsof the same raw materials. Controlled variables that go into the process mightinclude things like temperature, speed of flow and mixing time.
In essence, controlled variables are the things that can beadjusted with a knob or a dial. Uncontrolled variables that go into thisprocess may include human intervention and differences in work teams,production lots, days of the week, machines or even heads on the same machine.In the output of the process, variation may result from the measurement systemitself.
实质上,可控变量是能够通过工具(knob or dial)度量的事物。不可控变量在流程中可能包含人为干涉和工作团队、生产批量、每周工作天数、生产工具甚至是同一个生产工具的使用人数所产生的差异。在流程的输出中,差异可能由于度量体系本身所产生。
A good rule tofollow when you have, for example, two production lines doing the same thing ortwo pieces of equipment performing the same task, is to assume they vary untilproven otherwise. That’s especially true for the human factor. Experience showsthat in creating the initial data table and in the graphical representation ofthe process, the human element is a frequently overlooked source ofvariation.
可以遵循一个规则, 比方说,当你有两条生产线做同样的事情或者两套设备执行同样的任务,你可以先假定不一样,直到证明它们确实没差别。经验表明,在创建初始数据表和用图形表示的过程的时候,人为因素是经常被忽视的一种变量。
Inthe aforementioned pharmaceutical manufacturing process, the analyst mayoverlook that the process includes three shifts with four different workteams on the shifts.
As aresult of the observation and investigation that goes into constructing thegraphical representation of the process, however, the analyst makes sure thedata table records which team produced which batches on which days and that thedata are stratified in the analysis. The failure to take that human elementinto account results in a highly misleading data table and might obscure theultimate solution to the problem.
Graphing the data
Thegraphical representation of the process—and the understanding of the possiblesources of variation it helps generate—suggests ways in which the analyst cangraphically represent the data. Because data are almost always sequential, arun chart is often needed. In our example, the x-axis would register time and they-axis would register yield.
Ascatter plot also may be used, with process variables registered on the x-axisand process outputs registered on the y-axis. Other familiar graphicaltechniques include box plots, histograms, dot plots and Pareto charts.
Inusing any of these techniques, the goal is to make sure you are exploring therelationships of potentially important variables and preparing an appropriategraphical representation for purposes of statistical analysis. Plotting thedata in different ways can lead to insights and surprises about the sources ofvariation.
Statistically analyzing the data
Thestatistical analysis of the data, usually with the aid of statistical software,establishes what factors are statistically significant. For example, are thedifferences in yield produced by different work teams statisticallysignificant? What about variations in temperature or flow? What about themeasurement system itself?
Thekey to success lies in intimately knowing the data from the context of theprocess, graphically representing it and formulating a model that includes thecomparisons, relationships, tests and fits you are going to study.
Onceyou have created the graphics and done the statistical calculations, theresults should be checked against the model. Does it account for all of thevariation? In short, do the results make sense? If so, you can confidentlyreport your results.3
Beyond analysis to action
The final point about reporting the results offers a reminderthat analysis goes beyond the exploratory or confirmatory. The analyst also must beable to display and communicate results to decision makers. The most elegantanalysis possible is wasted if it fails to communicate and the organizationtherefore fails to act.
Earlyin my career, I was asked to analyze whether a chemical company’s new producthad adversely affected animals in safety studies. Personnelin the company’s lab insisted the data from the experiments showed adverseeffects, and the company should therefore cease development of the product.Analysts on the company’s business side had concluded the data showed noadverse effects. My analysis reached the same conclusion, and in a showdownmeeting between the business and the lab personnel, I presented my findings.
Atthe conclusion of my presentation, replete with analytical representations ofthe statistical significance of the data, the lab director remainedunconvinced. So I handed him one final graph: a dot plot that, for some reason,I had not included in my presentation.
Helooked at the graph and began to think aloud while everyone in the meeting satsilently. He continued to look and talk and look and talk. At last, he saidemphatically, “Maybe there isn’t a difference.”
Inthe absence of that persuasive graphical representation and model of the data,the company might have ceased production of what turned out to be a valuableand harmless product. The bottom line is that the analyst must notonly do data analysis that matters, but also make it matter.
©Ronald D. Snee, 2008.
Make DataMatterA five-prongedapproach to analyze process data
byRonald D. Snee
is data analysis an art or a science? Arguments exist for both sides,and many people simply come down in the middle. In my mind, I believe it’sboth.
数据分析是一门艺术或科学? 两种争论一直存在,很多人就简单的持中立态度。而在我看来,数据分析即是艺术也是科学。
Regardless of which view you take, the discussion misses acritical element—the need for an explicitly articulated strategy for dataanalysis. In fact, the various attitudes toward the nature of data analysisoften imply unreflective strategies.
无论你持有哪种观点,一个非常关键因素是不能少的,数据分析需要有一个明确而且逻辑非常缜密的策略。事实上, 很多数据分析方法都没有明确的策略。
Partisans of data analysis as an art simply might look at thedata, manipulate it based on their intuition and experience, and proceedconfidently to extract what they believe is useful information. The morescientific folks, with perhaps too much faith in numbers, go straight tostatistical software and do some indisputable number crunching.
Those who stand on middle ground—possibly the great majority ofpractitioners—do a little of both: rely on their insight to manipulate thedata, run the numbers, do some further manipulation and rerun the numbers untilthey achieve what they believe is a satisfactory result.
All of those approaches are likely to produce questionableresults in terms of what the analysis addresses and the significance of theresults.
Five activitiesPractitioners can avoid the pitfalls of these unreflective orad hoc approaches by adopting a clearly articulated, proven strategy foranalyzing process data and systematically following that strategy.1Such astrategy entails five essential activities:
- Understanding the context of the analysis.
- Examining the pedigree of the data.
- Graphically representing the process.
- Graphically representing the data.
- Statistically analyzing the data.
Note that these are iterative, as opposed to sequential, activities. Depending on the circumstances, the order of some of these activities may shift.
For example, in the mutually dependent iterations of this approach, the graphical representation of the process may precede the examination of the pedigree. In any case, most of these activities look forward and backward. The examination of the data’s pedigree—where it came from and how it was collected—may drive the analyst back to a fuller exploration of the context of the process to fill out that pedigree.
But the pedigree of the data also points to how the process should be graphically represented. That, in turn, could retrospectively suggest the need for additional types of data and prospectively affect the graphical representation. By engaging iteratively in these activities, you can arrive at important results that are ready to be fully and persuasively reported.
This approach offers at least three distinct advantages over less structured approaches. First, it is repeatable—it can be used in any situation that calls for the greater understanding of a process. Second, like sound processes themselves, it’s robust—flexible enough to encompass the wide variation of particulars to be found in different situations. Third, and most importantly, it’s more likely to produce useful results.
Understanding the context
It’s difficult to know precisely how to proceed until you ask the most basic of questions: What is the purpose of the analysis? Are you trying to confirm a hypothesis?
For example, a manufacturer that uses raw materials from two different vendors suspects that differences in quality are causing defects in the finished product. Data analysis can confirm or disconfirm the hypothesis and, in this example, identify the offending vendor. Such contexts call for what is sometimes referred to as confirmatory data analysis.
Alternatively, let’s say you’re trying to solve specific problems, the causes of which you do not understand. For example, a chemical process is producing unacceptable variations in purity from batch to batch. Or a business process, like a bank loan approval process, is taking far too long to complete. Or, perhaps a distributor’s percentage of on-time deliveries is fluctuating widely. These contexts call for exploratory data analysis, which must first have a hypothesis to test.
In confirmatory and exploratory analyses of a process, the goal is the same: find the inputs and the controlled and uncontrolled variables that have a major impact on the output of the process.2
Examining the pedigree
Data analysis begins with a data table, which is either provided to or constructed by the analyst. In either case, you should always question the data because data can be, among many other things:
• Incorrect: Some of the information is wrong—for example, when someone monitoring a process records the data incorrectly or a measurement device is faulty.
• Irrelevant: Some of it is the wrong information—for example, when data on the wrong variables are captured.
• Incomplete: Crucial information is missing—for example, when data on an important variable are missing.
• Misleading/biased: Data points you in the wrong direction for analysis—for example, when an important variable has been examined only over a short time, thus making it appear to be a constant.
An understanding of the context of the process can guard against these errors, but the context alone is insufficient. Given these and the many other shortcomings that can undermine the value of the data, it is absolutely critical to understand the pedigree of the data—where it came from and how it was collected.
For example, consider a batch manufacturing process in which a sample is taken every shift and carried to an analytical lab where it is tested for purity, and the results are recorded. Thus, the data trail is:
Production process ► sampling process ► testing process ► data-logging process.
To understand the resulting data, it is necessary to understand this data trail and the production process parameters. That is the pedigree of the data.
Incomplete understanding of the data’s pedigree can lead you down wrong analytical trails. Suppose, for example, a pharmaceutical company is experiencing differences in yield from batch to batch of a product because of the properties of the raw materials supplied by a vendor. Although the properties for each batch of raw materials are within specifications, the yield nevertheless varies unacceptably.
The analyst has been given a data table that includes the properties of the raw materials for each batch of product under consideration. But if the analyst does not know that some raw material batches were analyzed by the vendor’s quality assurance lab and some by the manufacturer, then there is a strong possibility the analysis will come up empty. By taking the time to understand the pedigree of the data fully, the analyst can save much frustration and fruitless work.
分析师得到一张包括每一批原材料属性的数据表 ,但是,如果分析师不知道部分原材料是由供应商的质量保证实验室提供和另外一部分是有制造商提供的话,分析最后结果很有可能一无所获。所以花一些时间去全面的了解数据的系谱,可以减少分析时候的挫折感和做一些徒劳的工作。
Some Guiding Principles
• The process provides the context for the problem being studied and the data being analyzed.
• Know the pedigree of the data—the who, what, when, where, why and how of its collection.
• Analysis is defined by how the data were generated.
• Understand the measurement system as well as the process.
• Be aware of human intervention in a process. Humans are often a large source of variation.
Graphing the process
A graphical representation of the process shows how the process works from end to end. Such representations fall into two broad categories: flow charts and schematics. A flow chart maps the sequence and flow of the process and often includes icons, such as pictures of a truck to represent a transportation step or smokestacks to indicate a factory.
A schematic representation is designed to exhibit the inputs and the controlled and uncontrolled variables that go into a process to produce its outputs. Both types of representation reinforce one another by suggesting what types of data are needed, where they can be found and how they can be analyzed.
Figure 1 is an elementary schematic representation of a process (such as pharmaceutical, chemical or loan approval As the analyst knows, the context is unacceptable variations in yield from batch to batch of the finished product. Therefore, “yield” is the key output.
Toget an accurate picture of the process again, however, analysts should notsimply rely on the context. To find out how the process really works, theyshould also observe the process first-hand and question the people who operateit. This investigation might also lead the analyst to further refine the pedigreeof the data—the who, when and why of its measurement and collection.
Withyield as the key output of a manufacturing process, the analyst can nowgraphically represent the process and fill in the blanks with the sources ofpossible variation that led to the unacceptable variations in yield. For theinputs, sources of variation might be energy, raw materials and different lotsof the same raw materials. Controlled variables that go into the process mightinclude things like temperature, speed of flow and mixing time.
In essence, controlled variables are the things that can beadjusted with a knob or a dial. Uncontrolled variables that go into thisprocess may include human intervention and differences in work teams,production lots, days of the week, machines or even heads on the same machine.In the output of the process, variation may result from the measurement systemitself.
实质上,可控变量是能够通过工具(knob or dial)度量的事物。不可控变量在流程中可能包含人为干涉和工作团队、生产批量、每周工作天数、生产工具甚至是同一个生产工具的使用人数所产生的差异。在流程的输出中,差异可能由于度量体系本身所产生。
A good rule tofollow when you have, for example, two production lines doing the same thing ortwo pieces of equipment performing the same task, is to assume they vary untilproven otherwise. That’s especially true for the human factor. Experience showsthat in creating the initial data table and in the graphical representation ofthe process, the human element is a frequently overlooked source ofvariation.
可以遵循一个规则, 比方说,当你有两条生产线做同样的事情或者两套设备执行同样的任务,你可以先假定不一样,直到证明它们确实没差别。经验表明,在创建初始数据表和用图形表示的过程的时候,人为因素是经常被忽视的一种变量。
Inthe aforementioned pharmaceutical manufacturing process, the analyst mayoverlook that the process includes three shifts with four different workteams on the shifts.
As aresult of the observation and investigation that goes into constructing thegraphical representation of the process, however, the analyst makes sure thedata table records which team produced which batches on which days and that thedata are stratified in the analysis. The failure to take that human elementinto account results in a highly misleading data table and might obscure theultimate solution to the problem.
Graphing the data
Thegraphical representation of the process—and the understanding of the possiblesources of variation it helps generate—suggests ways in which the analyst cangraphically represent the data. Because data are almost always sequential, arun chart is often needed. In our example, the x-axis would register time and they-axis would register yield.
Ascatter plot also may be used, with process variables registered on the x-axisand process outputs registered on the y-axis. Other familiar graphicaltechniques include box plots, histograms, dot plots and Pareto charts.
Inusing any of these techniques, the goal is to make sure you are exploring therelationships of potentially important variables and preparing an appropriategraphical representation for purposes of statistical analysis. Plotting thedata in different ways can lead to insights and surprises about the sources ofvariation.
Statistically analyzing the data
Thestatistical analysis of the data, usually with the aid of statistical software,establishes what factors are statistically significant. For example, are thedifferences in yield produced by different work teams statisticallysignificant? What about variations in temperature or flow? What about themeasurement system itself?
Thekey to success lies in intimately knowing the data from the context of theprocess, graphically representing it and formulating a model that includes thecomparisons, relationships, tests and fits you are going to study.
Onceyou have created the graphics and done the statistical calculations, theresults should be checked against the model. Does it account for all of thevariation? In short, do the results make sense? If so, you can confidentlyreport your results.3
Beyond analysis to action
The final point about reporting the results offers a reminderthat analysis goes beyond the exploratory or confirmatory. The analyst also must beable to display and communicate results to decision makers. The most elegantanalysis possible is wasted if it fails to communicate and the organizationtherefore fails to act.
Earlyin my career, I was asked to analyze whether a chemical company’s new producthad adversely affected animals in safety studies. Personnelin the company’s lab insisted the data from the experiments showed adverseeffects, and the company should therefore cease development of the product.Analysts on the company’s business side had concluded the data showed noadverse effects. My analysis reached the same conclusion, and in a showdownmeeting between the business and the lab personnel, I presented my findings.
Atthe conclusion of my presentation, replete with analytical representations ofthe statistical significance of the data, the lab director remainedunconvinced. So I handed him one final graph: a dot plot that, for some reason,I had not included in my presentation.
Helooked at the graph and began to think aloud while everyone in the meeting satsilently. He continued to look and talk and look and talk. At last, he saidemphatically, “Maybe there isn’t a difference.”
Inthe absence of that persuasive graphical representation and model of the data,the company might have ceased production of what turned out to be a valuableand harmless product. The bottom line is that the analyst must notonly do data analysis that matters, but also make it matter.
©Ronald D. Snee, 2008.
1 个回复
xy_persist (威望:2) (天津 河西区) 电子制造 部长 - 6Sigma黑带