Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. At the heart of statistics lies the process of data collection. Without accurate and relevant data, any analysis or conclusion would be unreliable. Therefore, understanding how data is gathered and classified is fundamental.
Imagine you want to know the average height of students in your college. Before you can calculate this average, you need to collect height measurements. The way you collect this data and the source of the data determine its type and reliability.
In this chapter, we will explore the two main types of data - primary and secondary - and understand their roles in statistics. We will also learn how data is organized, classified, and represented to extract meaningful information.
Let's begin by defining the two fundamental types of data:
Understanding the difference is crucial because it affects the accuracy, relevance, and cost of data collection.
Primary data is gathered directly from the source. For example, if you conduct a survey among your classmates to find out their favorite sport, the responses you collect are primary data. This data is tailored to your specific research question and is usually more reliable for your study.
Secondary data is information that already exists, collected by others for purposes other than your current study. For example, using government census data to analyze population trends is using secondary data. It saves time and resources but may not perfectly fit your research needs.
| Feature | Primary Data | Secondary Data |
|---|---|---|
| Definition | Collected firsthand for a specific purpose | Collected by others for a different purpose |
| Source | Direct from original respondents or observations | Existing records, reports, databases |
| Cost and Time | Usually more expensive and time-consuming | Less costly and quicker to obtain |
| Relevance | Highly relevant to the research question | May not perfectly fit the research needs |
| Examples | Surveys, experiments, interviews, observations | Census data, research articles, company reports |
Why choose one over the other? Here are some points to consider:
Once you decide to collect primary data, you need to choose an appropriate method. The main methods are:
graph TD A[Data Collection] --> B[Surveys] A --> C[Experiments] A --> D[Observation]
Surveys involve asking questions to a group of people to gather information. They can be conducted through questionnaires, interviews, or online forms. Surveys are useful for collecting opinions, preferences, or factual data.
Experiments involve manipulating one or more variables to observe the effect on other variables. This method is common in scientific research to establish cause-effect relationships.
Observation means collecting data by watching and recording behavior or events as they occur naturally, without interference. This method is useful when direct questioning is not possible or may bias the results.
Raw data collected directly or obtained from secondary sources can be large and difficult to interpret. To make sense of it, we organize data into meaningful groups or classes. This process is called classification.
For example, if you have the ages of 50 students, you might classify them into age groups like 15-17, 18-20, and so on.
Once classified, data is summarized in tables, a process called tabulation. Tabulation helps in presenting data clearly and facilitates further analysis.
| Age Group (years) | Number of Students |
|---|---|
| 15 - 17 | 12 |
| 18 - 20 | 25 |
| 21 - 23 | 10 |
| 24 - 26 | 3 |
A frequency distribution is a table that shows how often each value or class of data occurs. It is a key tool in statistics to summarize large datasets.
In addition to the frequency (count), two important concepts are:
| Age Group (years) | Frequency (f) | Cumulative Frequency (CF) | Relative Frequency (f/N) |
|---|---|---|---|
| 15 - 17 | 12 | 12 | 0.24 |
| 18 - 20 | 25 | 37 | 0.50 |
| 21 - 23 | 10 | 47 | 0.20 |
| 24 - 26 | 3 | 50 | 0.06 |
Note: Total number of students \( N = 50 \).
Data can be visually represented to make patterns easier to understand. Common graphical methods include bar charts, pie charts, and histograms.
Histograms are used to represent frequency distributions of continuous data grouped into classes. Unlike bar charts, histograms have adjacent bars touching each other to show the continuous nature of data.
In this section, we have learned:
Step 1: Understand the definition of primary and secondary data.
Step 2: Analyze each scenario:
Answer: 1 and 4 are primary data; 2 and 3 are secondary data.
45, 52, 47, 58, 62, 55, 48, 50, 53, 60, 65, 70, 68, 72, 75, 80, 78, 82, 85, 88, 90, 92, 95, 98, 100, 85, 75, 65, 55, 45
Construct a frequency distribution table with class intervals of width 10, and calculate cumulative and relative frequencies.Step 1: Determine the class intervals. Since marks range from 45 to 100, use intervals:
Step 2: Count the frequency \( f \) of marks in each class:
| Class Interval | Frequency (f) |
|---|---|
| 40 - 49 | 3 (45, 47, 48) |
| 50 - 59 | 5 (50, 52, 53, 55, 55) |
| 60 - 69 | 4 (60, 62, 65, 65) |
| 70 - 79 | 4 (70, 72, 75, 75) |
| 80 - 89 | 5 (78, 80, 82, 85, 85) |
| 90 - 99 | 5 (88, 90, 92, 95, 98) |
| 100 - 109 | 1 (100) |
Step 3: Calculate cumulative frequency (CF):
Step 4: Calculate relative frequency \( \frac{f}{N} \) where \( N=27 \):
Answer:
| Class Interval | Frequency (f) | Cumulative Frequency (CF) | Relative Frequency |
|---|---|---|---|
| 40 - 49 | 3 | 3 | 0.111 |
| 50 - 59 | 5 | 8 | 0.185 |
| 60 - 69 | 4 | 12 | 0.148 |
| 70 - 79 | 4 | 16 | 0.148 |
| 80 - 89 | 5 | 21 | 0.185 |
| 90 - 99 | 5 | 26 | 0.185 |
| 100 - 109 | 1 | 27 | 0.037 |
Step 1: Identify the highest bar. The bar for sales 21-40 has the greatest height (100 units), indicating it is the most frequent sales range.
Step 2: Count days with sales above 50,000 INR. These correspond to bars for 61-80 and 81-100 ranges.
Total days with sales above 50,000 INR = 20 + 10 = 30 days (Note: Since total days are 30, this suggests a need to re-check data or scale. Assuming histogram units correspond to days, the answer is 30 days.)
Answer: Most frequent sales range is 21-40 thousand INR. Approximately 30 days had sales above 50,000 INR.
2, 3, 5, 2, 4, 3, 5, 6, 2, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 8
Classify the data into groups and tabulate it.Step 1: Identify the range of data: minimum = 2, maximum = 8.
Step 2: Create classes for each number of books read:
Step 3: Count frequency for each class:
| Number of Books | Frequency |
|---|---|
| 2 | 3 |
| 3 | 4 |
| 4 | 3 |
| 5 | 4 |
| 6 | 3 |
| 7 | 2 |
| 8 | 1 |
Answer: The tabulated data summarizes the number of books read by students.
| Car Model | Number Sold |
|---|---|
| Model A | 40 |
| Model B | 25 |
| Model C | 35 |
Step 1: Calculate total cars sold \( N = 40 + 25 + 35 = 100 \).
Step 2: Calculate relative frequency for each model:
Answer: Relative frequencies are 0.40, 0.25, and 0.35 respectively.
When to use: When classifying data types in problems.
When to use: While constructing frequency distribution tables from raw data.
When to use: When calculating cumulative frequencies in tables.
When to use: When comparing class frequencies relative to the total.
When to use: When choosing the correct graphical representation.
| Feature | Primary Data | Secondary Data |
|---|---|---|
| Source | Collected firsthand | Collected by others |
| Cost | Higher | Lower |
| Relevance | Specific to study | May be general |
| Examples | Surveys, experiments | Census, reports |
Progress tracking is paywalled — subscribe to mark subtopics as understood and save your streak.
Go to practice →