What Is a Scatter Plot?
A scatter plot is a type of chart that displays the relationship between two continuous variables. In the field of quantitative analysis, particularly within the broader category of data visualization, scatter plots are instrumental for observing patterns, trends, and correlations in data60, 61, 62, 63. Each point on a scatter plot represents a pair of values, with one variable plotted on the horizontal (x-axis) and the other on the vertical (y-axis)57, 58, 59.
This visual representation allows analysts to quickly discern whether a positive, negative, or no relationship exists between the variables, and also helps in identifying potential outliers54, 55, 56. Scatter plots are sometimes referred to as scattergrams, scatter graphs, or scatter charts53.
History and Origin
The concept of plotting data points on a Cartesian coordinate system, which forms the basis of the scatter plot, dates back to René Descartes in the 17th century.52 While the precise origin of the scatter plot is not definitively credited to a single inventor, the English scientist John Frederick W. Herschel is thought to have created what may be the earliest known scatter plot in 1833 for a study on the orbits of double stars.49, 50, 51
The scatter plot gained significant importance and widespread adoption through the work of Francis Galton in the 1870s and 1880s.46, 47, 48 Galton, often considered the father of modern statistics, was particularly interested in genetics and used scatter plots to understand relationships, such as the correlation between the height of children and their parents.42, 43, 44, 45 His work with these diagrams led to the development of the statistical concepts of correlation and regression, foundational to much of modern statistical methods.39, 40, 41 The term "scatterplot" itself was later credited to mathematician Karl Pearson in 1906.38
Key Takeaways
- A scatter plot visually represents the relationship between two continuous variables.
- Each data point on the plot corresponds to a pair of values for the two variables.
- The arrangement of points helps identify positive, negative, or no correlation.
- They are valuable for detecting patterns, trends, and outliers in datasets.
- Scatter plots are a fundamental tool in statistical analysis and data exploration.
Formula and Calculation
A scatter plot itself does not involve a mathematical formula for its creation, as it is a graphical representation of raw data points. However, it often serves as the initial step in visualizing data before applying statistical formulas such as those for correlation coefficient or linear regression.
The coordinates for each point plotted on a scatter plot are directly from the observed data pairs. For a dataset with (n) observations of two variables, X and Y, each observation (i) will have a corresponding ordered pair ((x_i, y_i)) that is plotted as a single point on the graph.
For example, if analyzing the relationship between advertising spend (X) and sales revenue (Y), each data point on the scatter plot would represent a specific period's advertising spend and the corresponding sales revenue.
Interpreting the Scatter Plot
Interpreting a scatter plot primarily involves observing the overall pattern, direction, and strength of the relationship between the two variables.
-
Direction:
- Positive Correlation: If the points tend to rise from the lower left to the upper right, it suggests a positive relationship. As the value of the x-variable increases, the value of the y-variable also tends to increase. For example, a scatter plot of study hours versus exam scores might show a positive correlation.36, 37
- Negative Correlation: If the points tend to fall from the upper left to the lower right, it indicates a negative relationship. As the x-variable increases, the y-variable tends to decrease.
- No Correlation: If the points appear randomly scattered with no discernible pattern, there is little to no linear relationship between the variables.35
-
Strength: The closer the points cluster around a hypothetical line or curve, the stronger the relationship between the variables. If the points are widely dispersed, the relationship is weaker.33, 34
-
Form: The relationship can be linear (forming a straight line) or non-linear (forming a curve).
-
Outliers: Individual points that fall far away from the general cluster of points are considered outliers.30, 31, 32 These points can significantly influence statistical calculations like regression analysis and warrant further investigation. Understanding these elements is crucial for drawing meaningful insights from financial data.
Hypothetical Example
Consider a financial analyst examining the relationship between a company's research and development (R&D) expenditure and its subsequent annual revenue over several years.
Data:
Year | R&D Expenditure (in millions USD) | Annual Revenue (in millions USD) |
---|---|---|
1 | 10 | 150 |
2 | 12 | 170 |
3 | 8 | 130 |
4 | 15 | 190 |
5 | 11 | 160 |
6 | 14 | 185 |
7 | 9 | 145 |
To create a scatter plot, the analyst would plot each year's data as a point:
- The R&D Expenditure would be on the x-axis.
- The Annual Revenue would be on the y-axis.
For Year 1, a point would be plotted at (10, 150). For Year 2, it would be (12, 170), and so on.
Upon plotting these points, the analyst might observe that as R&D expenditure increases, annual revenue also generally increases, suggesting a positive relationship. This visual insight can then lead to further quantitative analysis, such as calculating the correlation coefficient to quantify the strength of this relationship or performing forecasting using regression.
Practical Applications
Scatter plots are widely used across various financial disciplines due to their ability to quickly reveal relationships between variables.
- Investment Analysis: Investors and analysts use scatter plots to assess the relationship between the returns of different assets or a single asset and a market index. For instance, plotting a stock's historical returns against the S&P 500's returns can help visualize its volatility and beta.
- Risk Management: They can be employed to visualize the relationship between different risk factors and their impact on portfolio performance, aiding in portfolio diversification strategies.
- Economic Analysis: Economists frequently use scatter plots to study macroeconomic relationships, such as the correlation between interest rates and inflation, or unemployment rates and GDP growth. For example, the Federal Reserve provides economic data and utilizes various forms of data visualization, including scatter diagrams in their analysis and communications, to illustrate concepts such as the "dot plot" which shows policymakers' projections for the federal funds rate.26, 27, 28, 29 Organizations like the OECD (Organisation for Economic Co-operation and Development) also provide extensive datasets that can be visualized using scatter plots to understand economic trends globally.24, 25
- Accounting: In managerial accounting, the scattergraph method is used to separate fixed and variable components of semi-variable expenses by plotting data points representing activity levels and costs. This helps in cost accounting and budgeting.
- Market Research: Businesses use scatter plots to analyze consumer behavior, such as the relationship between product pricing and sales volume.23 This informs marketing strategies and pricing decisions.
- Compliance and Regulation: Regulators might use scatter plots to identify unusual patterns or potential instances of market manipulation by analyzing trading volumes against price movements. News organizations like Reuters Graphics also employ diverse charts and visualizations to contextualize complex financial and economic information.22
Limitations and Criticisms
Despite their utility, scatter plots have certain limitations:
- Overplotting: When a large number of data points are plotted in a small area, especially with discrete variables or many identical values, points can overlap, making it difficult to discern the true density or distribution of data. This phenomenon, known as overplotting or overdraw, can obscure outliers and hide underlying data patterns.20, 21
- Dimensionality: Scatter plots are best suited for visualizing the relationship between two variables. While techniques like adding color, size, or shape to points can introduce a third or fourth variable (creating a bubble chart), adding too many dimensions can make the plot difficult to interpret.
- Causation vs. Correlation: A strong correlation observed in a scatter plot does not necessarily imply causation. There might be a third variable influencing both, or the relationship could be coincidental. Users must exercise caution in inferring cause-and-effect solely from a scatter plot.
- Subjectivity: Interpreting the strength and form of a relationship can sometimes be subjective, especially when the correlation is weak or the pattern is not clearly linear. This can lead to different interpretations by different observers.
- Data Scale: The appearance of a scatter plot and the perceived strength of a relationship can be influenced by the scale chosen for the axes. Manipulating the axis ranges can make a weak correlation appear stronger or vice-versa, potentially misleading the viewer.19
Scatter Plot vs. Line Graph
While both scatter plots and line graphs are powerful tools for data visualization, they serve different primary purposes and are best suited for different types of data.
A scatter plot is designed to show the relationship between two continuous variables for individual data points.16, 17, 18 Each point is independent of the others in terms of its position relative to the x-axis, allowing for the observation of patterns, clusters, and outliers that indicate correlation.14, 15 There is no implied sequence or connection between the plotted points other than their bivariate relationship.
In contrast, a line graph is primarily used to display the change in a variable over a continuous interval, typically time.11, 12, 13 Data points are connected by lines, emphasizing trends, rates of change, and historical progression.9, 10 Each point on a line graph is sequentially linked to the next, making it ideal for visualizing time series data, such as stock prices over a period or economic indicators over quarters.8 Unlike a scatter plot where the relationship between two variables is central, a line graph focuses on how a single variable evolves over time, or how multiple variables compare in their temporal evolution.
FAQs
What type of data is best suited for a scatter plot?
Scatter plots are best suited for visualizing the relationship between two quantitative (numerical) variables. They are particularly useful when you want to see if there's a correlation or pattern between how two different things change together.
Can a scatter plot show more than two variables?
While a basic scatter plot shows two variables (x and y), you can visually incorporate additional variables by using different colors, shapes, or sizes for the data points. For instance, larger points could represent a higher value of a third variable, or different colors could denote different categories, effectively transforming it into a multivariate chart.
What does it mean if points on a scatter plot form a straight line?
If points on a scatter plot closely form a straight line, it indicates a strong linear relationship between the two variables. If the line goes up from left to right, it's a strong positive correlation; if it goes down, it's a strong negative correlation. The tighter the points are to the line, the stronger the relationship.7
How do I identify outliers on a scatter plot?
Outliers on a scatter plot are data points that lie far away from the main cluster of points or the general trend observed. They deviate significantly from the pattern established by the majority of the data.4, 5, 6 Identifying them often requires visual inspection and can be confirmed with statistical methods.
What is the purpose of adding a trend line to a scatter plot?
A trend line, also known as a line of best fit or regression line, is often added to a scatter plot to visually represent the overall pattern or trend in the data.1, 2, 3 It helps to summarize the relationship between the variables and can be used for prediction or to highlight the direction and strength of the correlation more clearly.