👁 Preview — Study, Practice and Revise are open; mock tests and the rest of the syllabus unlock on subscription. Unlock all · ₹4,999
← Back to Collection and Classification of Data
Study mode

Primary and Secondary Data

Introduction to Data Collection in Statistics

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. At the heart of statistics lies the process of data collection. Without accurate and relevant data, any analysis or conclusion would be unreliable. Therefore, understanding how data is gathered and classified is fundamental.

Imagine you want to know the average height of students in your college. Before you can calculate this average, you need to collect height measurements. The way you collect this data and the source of the data determine its type and reliability.

In this chapter, we will explore the two main types of data - primary and secondary - and understand their roles in statistics. We will also learn how data is organized, classified, and represented to extract meaningful information.

Primary and Secondary Data

Let's begin by defining the two fundamental types of data:

  • Primary Data: This is data collected directly by the researcher for a specific purpose. It is original and firsthand information.
  • Secondary Data: This data is collected by someone else for a different purpose but is used by the researcher for their own analysis.

Understanding the difference is crucial because it affects the accuracy, relevance, and cost of data collection.

What is Primary Data?

Primary data is gathered directly from the source. For example, if you conduct a survey among your classmates to find out their favorite sport, the responses you collect are primary data. This data is tailored to your specific research question and is usually more reliable for your study.

What is Secondary Data?

Secondary data is information that already exists, collected by others for purposes other than your current study. For example, using government census data to analyze population trends is using secondary data. It saves time and resources but may not perfectly fit your research needs.

Comparison of Primary and Secondary Data
Feature Primary Data Secondary Data
Definition Collected firsthand for a specific purpose Collected by others for a different purpose
Source Direct from original respondents or observations Existing records, reports, databases
Cost and Time Usually more expensive and time-consuming Less costly and quicker to obtain
Relevance Highly relevant to the research question May not perfectly fit the research needs
Examples Surveys, experiments, interviews, observations Census data, research articles, company reports

Advantages and Disadvantages

Why choose one over the other? Here are some points to consider:

  • Primary Data Advantages: Accuracy, control over data quality, tailored information.
  • Primary Data Disadvantages: Expensive, time-consuming, requires planning.
  • Secondary Data Advantages: Cost-effective, readily available, useful for preliminary research.
  • Secondary Data Disadvantages: May be outdated, incomplete, or irrelevant.

Methods of Data Collection

Once you decide to collect primary data, you need to choose an appropriate method. The main methods are:

graph TD    A[Data Collection] --> B[Surveys]    A --> C[Experiments]    A --> D[Observation]

Surveys

Surveys involve asking questions to a group of people to gather information. They can be conducted through questionnaires, interviews, or online forms. Surveys are useful for collecting opinions, preferences, or factual data.

Experiments

Experiments involve manipulating one or more variables to observe the effect on other variables. This method is common in scientific research to establish cause-effect relationships.

Observation

Observation means collecting data by watching and recording behavior or events as they occur naturally, without interference. This method is useful when direct questioning is not possible or may bias the results.

Classification and Tabulation

Raw data collected directly or obtained from secondary sources can be large and difficult to interpret. To make sense of it, we organize data into meaningful groups or classes. This process is called classification.

For example, if you have the ages of 50 students, you might classify them into age groups like 15-17, 18-20, and so on.

Once classified, data is summarized in tables, a process called tabulation. Tabulation helps in presenting data clearly and facilitates further analysis.

Example of Tabulated Data: Age of Students
Age Group (years) Number of Students
15 - 1712
18 - 2025
21 - 2310
24 - 263

Frequency Distribution

A frequency distribution is a table that shows how often each value or class of data occurs. It is a key tool in statistics to summarize large datasets.

In addition to the frequency (count), two important concepts are:

  • Cumulative Frequency: The running total of frequencies up to a certain class.
  • Relative Frequency: The proportion of each class frequency to the total number of observations.
Frequency Distribution Table with Cumulative and Relative Frequencies
Age Group (years) Frequency (f) Cumulative Frequency (CF) Relative Frequency (f/N)
15 - 1712120.24
18 - 2025370.50
21 - 2310470.20
24 - 263500.06

Note: Total number of students \( N = 50 \).

Graphical Representation and Histograms

Data can be visually represented to make patterns easier to understand. Common graphical methods include bar charts, pie charts, and histograms.

Histograms are used to represent frequency distributions of continuous data grouped into classes. Unlike bar charts, histograms have adjacent bars touching each other to show the continuous nature of data.

15-17 18-20 21-23 24-26 0 60 120 180 Histogram of Student Ages

Summary

In this section, we have learned:

  • The importance of data collection and the distinction between primary and secondary data.
  • Various methods of collecting primary data: surveys, experiments, and observation.
  • How to classify and tabulate raw data for better understanding.
  • Constructing frequency distribution tables including cumulative and relative frequencies.
  • Representing data graphically using histograms to visualize frequency distributions.

Formula Bank

Relative Frequency
\[ \text{Relative Frequency} = \frac{f}{N} \]
where: \( f \) = frequency of a class, \( N \) = total number of observations
Cumulative Frequency
\[ CF_i = \sum_{j=1}^i f_j \]
where: \( CF_i \) = cumulative frequency up to class \( i \), \( f_j \) = frequency of class \( j \)
Example 1: Identifying Primary and Secondary Data Easy
Classify the following data as primary or secondary:
  1. Data collected by a researcher through interviews with farmers about crop yields.
  2. Population statistics obtained from the latest government census report.
  3. Sales figures from a company's annual report used for market analysis.
  4. Temperature readings recorded by a weather station for a research project.

Step 1: Understand the definition of primary and secondary data.

Step 2: Analyze each scenario:

  • 1. Data collected firsthand by interviews -> Primary Data.
  • 2. Data from government census (existing data) -> Secondary Data.
  • 3. Sales figures from reports prepared by others -> Secondary Data.
  • 4. Temperature readings recorded directly for a project -> Primary Data.

Answer: 1 and 4 are primary data; 2 and 3 are secondary data.

Example 2: Constructing a Frequency Distribution Table Medium
Given the following data representing the marks obtained by 30 students in a test:

45, 52, 47, 58, 62, 55, 48, 50, 53, 60, 65, 70, 68, 72, 75, 80, 78, 82, 85, 88, 90, 92, 95, 98, 100, 85, 75, 65, 55, 45

Construct a frequency distribution table with class intervals of width 10, and calculate cumulative and relative frequencies.

Step 1: Determine the class intervals. Since marks range from 45 to 100, use intervals:

  • 40-49
  • 50-59
  • 60-69
  • 70-79
  • 80-89
  • 90-99
  • 100-109

Step 2: Count the frequency \( f \) of marks in each class:

Class Interval Frequency (f)
40 - 493 (45, 47, 48)
50 - 595 (50, 52, 53, 55, 55)
60 - 694 (60, 62, 65, 65)
70 - 794 (70, 72, 75, 75)
80 - 895 (78, 80, 82, 85, 85)
90 - 995 (88, 90, 92, 95, 98)
100 - 1091 (100)

Step 3: Calculate cumulative frequency (CF):

  • 40-49: 3
  • 50-59: 3 + 5 = 8
  • 60-69: 8 + 4 = 12
  • 70-79: 12 + 4 = 16
  • 80-89: 16 + 5 = 21
  • 90-99: 21 + 5 = 26
  • 100-109: 26 + 1 = 27

Step 4: Calculate relative frequency \( \frac{f}{N} \) where \( N=27 \):

  • 40-49: 3/27 ≈ 0.111
  • 50-59: 5/27 ≈ 0.185
  • 60-69: 4/27 ≈ 0.148
  • 70-79: 4/27 ≈ 0.148
  • 80-89: 5/27 ≈ 0.185
  • 90-99: 5/27 ≈ 0.185
  • 100-109: 1/27 ≈ 0.037

Answer:

Class Interval Frequency (f) Cumulative Frequency (CF) Relative Frequency
40 - 49330.111
50 - 59580.185
60 - 694120.148
70 - 794160.148
80 - 895210.185
90 - 995260.185
100 - 1091270.037
Example 3: Interpreting a Histogram Medium
The histogram below shows the distribution of daily sales (in INR thousands) at a shop over 30 days. Analyze the histogram to answer:
  • Which sales range is most frequent?
  • What is the approximate total number of days with sales above 50,000 INR?
0-20 21-40 41-60 61-80 81-100 0 50 100 Histogram of Daily Sales (INR thousands)

Step 1: Identify the highest bar. The bar for sales 21-40 has the greatest height (100 units), indicating it is the most frequent sales range.

Step 2: Count days with sales above 50,000 INR. These correspond to bars for 61-80 and 81-100 ranges.

  • 61-80 range: 20 days (height 20)
  • 81-100 range: 10 days (height 10)

Total days with sales above 50,000 INR = 20 + 10 = 30 days (Note: Since total days are 30, this suggests a need to re-check data or scale. Assuming histogram units correspond to days, the answer is 30 days.)

Answer: Most frequent sales range is 21-40 thousand INR. Approximately 30 days had sales above 50,000 INR.

Example 4: Tabulation of Classified Data Easy
The following data shows the number of books read by 20 students in a month:

2, 3, 5, 2, 4, 3, 5, 6, 2, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 8

Classify the data into groups and tabulate it.

Step 1: Identify the range of data: minimum = 2, maximum = 8.

Step 2: Create classes for each number of books read:

  • 2 books
  • 3 books
  • 4 books
  • 5 books
  • 6 books
  • 7 books
  • 8 books

Step 3: Count frequency for each class:

Number of Books Frequency
23
34
43
54
63
72
81

Answer: The tabulated data summarizes the number of books read by students.

Example 5: Calculating Relative Frequency Easy
Given the frequency distribution of cars sold in a month by a dealership:
Car Model Number Sold
Model A40
Model B25
Model C35
Calculate the relative frequency for each model.

Step 1: Calculate total cars sold \( N = 40 + 25 + 35 = 100 \).

Step 2: Calculate relative frequency for each model:

  • Model A: \( \frac{40}{100} = 0.40 \)
  • Model B: \( \frac{25}{100} = 0.25 \)
  • Model C: \( \frac{35}{100} = 0.35 \)

Answer: Relative frequencies are 0.40, 0.25, and 0.35 respectively.

Tips & Tricks

Tip: Always check if data is collected firsthand or from existing sources to quickly identify primary vs secondary data.

When to use: When classifying data types in problems.

Tip: Use tally marks to count frequencies efficiently before tabulating data.

When to use: While constructing frequency distribution tables from raw data.

Tip: Cumulative frequency can be quickly found by adding the current class frequency to the previous cumulative frequency.

When to use: When calculating cumulative frequencies in tables.

Tip: Relative frequency is a fraction or percentage; converting to percentage helps in easier interpretation.

When to use: When comparing class frequencies relative to the total.

Tip: Histograms are best used for continuous data grouped into classes; bar charts are for categorical data.

When to use: When choosing the correct graphical representation.

Common Mistakes to Avoid

❌ Confusing primary data with secondary data by assuming all collected data is primary.
✓ Verify the source and purpose of data collection to classify correctly.
Why: Students often overlook the origin and intent behind data collection.
❌ Incorrectly summing frequencies when calculating cumulative frequency.
✓ Add frequencies sequentially from the first class upwards without skipping.
Why: Rushing through calculations leads to missing or double-counting frequencies.
❌ Using bar charts for continuous grouped data instead of histograms.
✓ Use histograms for continuous data with class intervals to accurately represent frequency.
Why: Misunderstanding data types causes inappropriate graph selection.
❌ Not converting relative frequencies into percentages for better understanding.
✓ Multiply relative frequency by 100 to express as percentage.
Why: Students may find decimals less intuitive than percentages.
❌ Mixing up class intervals during classification and tabulation.
✓ Ensure class intervals are mutually exclusive and exhaustive.
Why: Overlapping or missing intervals cause inaccurate data representation.
FeaturePrimary DataSecondary Data
SourceCollected firsthandCollected by others
CostHigherLower
RelevanceSpecific to studyMay be general
ExamplesSurveys, experimentsCensus, reports

Constructing Frequency Distribution and Graphical Representation

  • Classify raw data into meaningful intervals or categories
  • Count frequencies for each class
  • Calculate cumulative and relative frequencies
  • Use histograms for continuous data and bar charts for categorical data
  • Visualize data to identify patterns quickly
Key Takeaway:

Organizing and representing data effectively is key to accurate statistical analysis.

Curated videos per subtopic
Top YouTube explainers, AI-ranked for your exam and language. Unlocks with subscription.
Unlock

Try Practice next.

Progress tracking is paywalled — subscribe to mark subtopics as understood and save your streak.

Go to practice →
Ask a doubt
Primary and Secondary Data · 10 free messages
Ask me anything about this subtopic. You have 10 free messages this session — chat history isn't saved in preview.