Probability-Based Spam Email Detection System
Created byTina Hill
16 views0 downloads

Probability-Based Spam Email Detection System

Grade 12Math2 days
In this project, students design and implement a probability-based spam email detection system. They apply probability theory, statistical analysis, and data analysis to classify emails as spam or not spam, extract key characteristics from email data, and quantify uncertainty in data to make predictions. Students will define sample spaces, apply probability rules (including conditional probability), and refine their system to improve accuracy, culminating in a presentation and report detailing their process and results.
Probability TheorySpam DetectionConditional ProbabilityEmail ClassificationStatistical AnalysisData Analysis
Want to create your own PBL Recipe?Use our AI-powered tools to design engaging project-based learning experiences for your students.
πŸ“

Inquiry Framework

Question Framework

Driving Question

The overarching question that guides the entire project.How can we leverage the principles of probability and statistical analysis to design and implement an effective spam email detection system that accurately distinguishes between legitimate and unsolicited emails, while also addressing the inherent uncertainties in data classification?

Essential Questions

Supporting questions that break down major concepts.
  • How can probability be used to classify spam emails?
  • What are the key characteristics that differentiate spam emails from legitimate emails?
  • How can you quantify the uncertainty and make predictions based on data analysis?

Standards & Learning Goals

Learning Goals

By the end of this project, students will be able to:
  • Apply probability theory to classify emails as spam or not spam.
  • Identify and extract key characteristics from email data that differentiate spam from legitimate emails.
  • Quantify uncertainty in data and make predictions based on statistical analysis.
  • Design and implement a spam email detection system.
  • Evaluate the accuracy and effectiveness of the spam detection system

Common Core Standards

S.CP.A.1
Primary
Describe events as subsets of a sample space (the set of outcomes) using characteristics (or categories) of the outcomes, or as unions, intersections, or complements of other events (β€œor,” β€œand,” β€œnot”).Reason: This standard is directly applicable as students will be defining the event of an email being spam as a subset of all emails, using characteristics to define this subset.
S.CP.B.4
Primary
Demonstrate an understanding of the addition rule, the multiplication rule, conditional probability, and independence.Reason: This standard is essential as students will use these probability rules to calculate the likelihood of an email being spam based on identified characteristics.

Entry Events

Events that will be used to introduce the project to students

The Inbox Overload Simulation

Simulate a 'spam outbreak' scenario where students receive a flood of sample emails (some spam, some legitimate) within a short time frame. Task them with quickly filtering the emails based on initial observations. This creates a sense of urgency and highlights the need for an efficient and automated spam detection system, motivating them to learn about probability-based solutions.
πŸ“š

Portfolio Activities

Portfolio Activities

These activities progressively build towards your learning goals, with each submission contributing to the student's final portfolio.
Activity 1

Defining the Email Universe

Students will define the sample space of all possible emails and then identify characteristics that define subsets of spam and non-spam emails. They will use set theory to understand the relationships between these sets.

Steps

Here is some basic scaffolding to help students complete the activity.
1. Define the sample space of all possible emails.
2. Identify characteristics of spam emails (e.g., sender, subject keywords, links).
3. Identify characteristics of non-spam emails.
4. Represent spam and non-spam emails as subsets of the sample space.
5. Create Venn diagrams to illustrate the relationships (unions, intersections, complements) between these subsets.
6. Write a report summarizing the findings, including definitions, identified characteristics, and Venn diagrams.

Final Product

What students will submit as the final product of the activityA written report defining the sample space, the characteristics of spam and non-spam subsets, and Venn diagrams illustrating the relationships between these sets.

Alignment

How this activity aligns with the learning objectives & standardsCovers S.CP.A.1 by having students define the sample space and subsets of spam and non-spam emails.
Activity 2

Probability Rule Application

Students will learn and apply the addition and multiplication rules of probability to calculate the likelihood of an email being spam based on identified characteristics such as sender, subject keywords, and the presence of suspicious links.

Steps

Here is some basic scaffolding to help students complete the activity.
1. Research and understand the addition and multiplication rules of probability.
2. Select a set of sample emails and identify key characteristics.
3. Calculate the probability of an email being spam based on individual characteristics using the multiplication rule.
4. Calculate the probability of an email being spam using multiple characteristics and the addition rule.
5. Prepare a presentation to demonstrate the calculations and explain the application of the rules.

Final Product

What students will submit as the final product of the activityA presentation that demonstrates the application of the addition and multiplication rules in determining the probability of an email being spam, along with the calculations and justifications.

Alignment

How this activity aligns with the learning objectives & standardsCovers S.CP.B.4 by applying the addition and multiplication rules to calculate probabilities of emails being spam or not spam based on identified characteristics.
Activity 3

Conditional Probability Refinement

Students will use conditional probability to improve the accuracy of their spam detection system. They will calculate the probability of an email being spam given certain characteristics and refine their system based on these probabilities.

Steps

Here is some basic scaffolding to help students complete the activity.
1. Learn about conditional probability and its applications.
2. Collect data on various email characteristics and whether the emails are spam or not.
3. Calculate the conditional probability of an email being spam given specific characteristics.
4. Incorporate these conditional probabilities into the spam detection system.
5. Test the revised system and evaluate its performance.
6. Write a report detailing the process and results.

Final Product

What students will submit as the final product of the activityA revised spam detection system algorithm that incorporates conditional probabilities, along with a report explaining the changes and their impact on accuracy.

Alignment

How this activity aligns with the learning objectives & standardsCovers S.CP.B.4 by calculating conditional probabilities to refine spam detection accuracy based on multiple factors. Also aligns with learning goal 'Quantify uncertainty in data and make predictions based on statistical analysis.'
πŸ†

Rubric & Reflection

Portfolio Rubric

Grading criteria for assessing the overall project portfolio

Spam Email Detection System Rubric

Category 1

Email Universe Definition and Representation

Assesses the student's ability to define the email universe, identify key characteristics, and represent relationships using set theory.
Criterion 1

Sample Space Definition

The clarity and accuracy of the sample space definition, including all possible email types.

Exemplary
4 Points

The sample space is comprehensively and accurately defined, including all possible email types with detailed explanations. Examples are well-chosen and highly relevant.

Proficient
3 Points

The sample space is accurately defined, covering most email types. Examples are relevant and clearly illustrate the definition.

Developing
2 Points

The sample space definition is attempted but lacks clarity or completeness. Some email types are omitted, and examples may be unclear.

Beginning
1 Points

The sample space definition is missing or inaccurate. There is a lack of understanding of what constitutes the sample space of emails.

Criterion 2

Characteristic Identification and Justification

Identification and justification of characteristics for spam and non-spam emails. Includes why these characteristics are indicative of each category.

Exemplary
4 Points

Exhaustively identifies and justifies characteristics of both spam and non-spam emails with insightful explanations. Justifications are supported by relevant data or examples.

Proficient
3 Points

Clearly identifies and justifies key characteristics of spam and non-spam emails with logical explanations. Justifications are generally sound and appropriate.

Developing
2 Points

Identifies some characteristics of spam and non-spam emails but lacks thorough justification. Explanations may be superficial or missing.

Beginning
1 Points

Struggles to identify relevant characteristics of spam and non-spam emails. Justifications are absent or illogical.

Criterion 3

Venn Diagram Representation

Accuracy and effectiveness of Venn diagrams in illustrating relationships between spam and non-spam subsets (unions, intersections, complements).

Exemplary
4 Points

Venn diagrams are exceptionally accurate and insightful, clearly illustrating complex relationships between spam and non-spam subsets. Diagrams enhance understanding significantly.

Proficient
3 Points

Venn diagrams are accurate and effectively illustrate the relationships between spam and non-spam subsets. Diagrams are well-labeled and easy to understand.

Developing
2 Points

Venn diagrams contain inaccuracies or are unclear, making it difficult to understand the relationships between spam and non-spam subsets.

Beginning
1 Points

Venn diagrams are missing, or completely inaccurate, failing to illustrate the relationships between spam and non-spam subsets.

Criterion 4

Report Quality and Clarity

The overall quality and clarity of the written report, including organization, grammar, and coherence.

Exemplary
4 Points

The report is exceptionally well-written, organized, and coherent. It demonstrates a deep understanding of the concepts and presents them in a clear and compelling manner. Grammar and mechanics are flawless.

Proficient
3 Points

The report is well-written, organized, and coherent. It demonstrates a good understanding of the concepts and presents them clearly. Grammar and mechanics are generally correct.

Developing
2 Points

The report is somewhat disorganized and lacks clarity in places. It demonstrates a basic understanding of the concepts but may contain grammatical errors or inconsistencies.

Beginning
1 Points

The report is poorly written, disorganized, and difficult to understand. It demonstrates a limited understanding of the concepts and contains numerous grammatical errors.

Category 2

Probability Rule Application and Presentation

Evaluates the student's ability to apply probability rules to classify emails and present their findings effectively.
Criterion 1

Understanding of Probability Rules

Accuracy and completeness of research and understanding of the addition and multiplication rules of probability.

Exemplary
4 Points

Demonstrates a comprehensive and nuanced understanding of the addition and multiplication rules with clear explanations and insightful examples beyond the basic definitions.

Proficient
3 Points

Demonstrates a solid and accurate understanding of the addition and multiplication rules, explaining them clearly with relevant examples.

Developing
2 Points

Demonstrates a partial understanding of the addition and multiplication rules, with some inaccuracies or gaps in explanation.

Beginning
1 Points

Demonstrates a limited or inaccurate understanding of the addition and multiplication rules.

Criterion 2

Sample Email and Characteristic Selection

Appropriateness and relevance of the chosen sample emails and identified characteristics for applying probability rules.

Exemplary
4 Points

Selects highly appropriate and diverse sample emails, identifying insightful and relevant characteristics that maximize the effectiveness of probability rule application.

Proficient
3 Points

Selects appropriate sample emails and identifies relevant characteristics for applying probability rules effectively.

Developing
2 Points

Selects sample emails and identifies characteristics, but some may be inappropriate or irrelevant for applying probability rules.

Beginning
1 Points

Selects inappropriate sample emails and struggles to identify relevant characteristics for applying probability rules.

Criterion 3

Accuracy of Probability Calculations

Accuracy of probability calculations using the multiplication and addition rules.

Exemplary
4 Points

Performs calculations with impeccable accuracy, demonstrating a mastery of the rules and their application to the email samples.

Proficient
3 Points

Performs calculations accurately, demonstrating a good understanding of the rules and their application.

Developing
2 Points

Makes some errors in calculations, indicating a need for further practice and understanding of the rules.

Beginning
1 Points

Makes significant errors in calculations, demonstrating a lack of understanding of the rules and their application.

Criterion 4

Presentation Clarity and Effectiveness

Clarity and effectiveness of the presentation in demonstrating the calculations and explaining the application of probability rules.

Exemplary
4 Points

The presentation is exceptionally clear, engaging, and insightful, effectively demonstrating the calculations and explaining the application of probability rules with a high level of understanding.

Proficient
3 Points

The presentation is clear and effective, demonstrating the calculations and explaining the application of probability rules in a well-organized manner.

Developing
2 Points

The presentation is somewhat unclear or disorganized, making it difficult to fully understand the calculations and application of probability rules.

Beginning
1 Points

The presentation is unclear, disorganized, and fails to effectively demonstrate the calculations or explain the application of probability rules.

Category 3

Conditional Probability and System Refinement

Assesses the student's ability to refine a spam detection system using conditional probability.
Criterion 1

Understanding of Conditional Probability

Understanding and explanation of conditional probability and its relevance to spam detection.

Exemplary
4 Points

Demonstrates a sophisticated understanding of conditional probability, explaining its nuances and implications for spam detection with insightful examples and connections to real-world applications.

Proficient
3 Points

Demonstrates a clear and accurate understanding of conditional probability, explaining its relevance to spam detection with appropriate examples.

Developing
2 Points

Demonstrates a partial understanding of conditional probability, with some inaccuracies or gaps in explanation and relevance to spam detection.

Beginning
1 Points

Demonstrates a limited or inaccurate understanding of conditional probability and its relevance to spam detection.

Criterion 2

Data Collection Quality and Relevance

Quality and relevance of data collected on email characteristics and spam/non-spam classification.

Exemplary
4 Points

Collects highly relevant and comprehensive data on email characteristics and spam/non-spam classification, ensuring a robust dataset for accurate conditional probability calculations.

Proficient
3 Points

Collects relevant data on email characteristics and spam/non-spam classification, providing a sufficient dataset for conditional probability calculations.

Developing
2 Points

Collects some data on email characteristics and spam/non-spam classification, but the data may be incomplete or lack relevance.

Beginning
1 Points

Collects limited or irrelevant data on email characteristics and spam/non-spam classification, hindering conditional probability calculations.

Criterion 3

Accuracy and Incorporation of Calculations

Accuracy of conditional probability calculations and their incorporation into the spam detection system.

Exemplary
4 Points

Performs conditional probability calculations with impeccable accuracy and seamlessly integrates them into the spam detection system, significantly improving its performance.

Proficient
3 Points

Performs conditional probability calculations accurately and incorporates them effectively into the spam detection system, improving its performance.

Developing
2 Points

Makes some errors in conditional probability calculations or struggles to effectively incorporate them into the spam detection system.

Beginning
1 Points

Makes significant errors in conditional probability calculations and fails to effectively incorporate them into the spam detection system.

Criterion 4

Testing, Evaluation, and Reporting

Thoroughness of testing, evaluation, and reporting on the revised spam detection system's performance.

Exemplary
4 Points

Conducts rigorous testing and evaluation of the revised system, providing a comprehensive and insightful report that clearly articulates the impact of conditional probabilities on accuracy and overall performance. Includes error analysis and suggestions for future improvements.

Proficient
3 Points

Tests and evaluates the revised system, providing a clear report that explains the impact of conditional probabilities on accuracy and performance.

Developing
2 Points

Provides limited testing and evaluation of the revised system, with a report that lacks detail or clarity regarding the impact of conditional probabilities on accuracy and performance.

Beginning
1 Points

Fails to adequately test and evaluate the revised system, with a report that is incomplete or missing key information.

Reflection Prompts

End-of-project reflection questions to get students to think about their learning
Question 1

Reflecting on the entire project, what was the most surprising thing you learned about the challenges of building a spam detection system?

Text
Required
Question 2

To what extent do you feel confident in your ability to apply probability theory to real-world classification problems, such as spam detection?

Scale
Required
Question 3

If you were to continue working on this spam detection system, what is one specific improvement you would focus on, and why?

Text
Required
Question 4

How well did your team work together to overcome challenges in this project?

Multiple choice
Required
Options
Very well, we were highly collaborative and efficient
Well, we mostly worked together effectively
Adequately, but there were some communication challenges
Poorly, we struggled to collaborate
Very poorly, teamwork was minimal