
Probability-Based Spam Email Detection System
Inquiry Framework
Question Framework
Driving Question
The overarching question that guides the entire project.How can we leverage the principles of probability and statistical analysis to design and implement an effective spam email detection system that accurately distinguishes between legitimate and unsolicited emails, while also addressing the inherent uncertainties in data classification?Essential Questions
Supporting questions that break down major concepts.- How can probability be used to classify spam emails?
- What are the key characteristics that differentiate spam emails from legitimate emails?
- How can you quantify the uncertainty and make predictions based on data analysis?
Standards & Learning Goals
Learning Goals
By the end of this project, students will be able to:- Apply probability theory to classify emails as spam or not spam.
- Identify and extract key characteristics from email data that differentiate spam from legitimate emails.
- Quantify uncertainty in data and make predictions based on statistical analysis.
- Design and implement a spam email detection system.
- Evaluate the accuracy and effectiveness of the spam detection system
Common Core Standards
Entry Events
Events that will be used to introduce the project to studentsThe Inbox Overload Simulation
Simulate a 'spam outbreak' scenario where students receive a flood of sample emails (some spam, some legitimate) within a short time frame. Task them with quickly filtering the emails based on initial observations. This creates a sense of urgency and highlights the need for an efficient and automated spam detection system, motivating them to learn about probability-based solutions.Portfolio Activities
Portfolio Activities
These activities progressively build towards your learning goals, with each submission contributing to the student's final portfolio.Defining the Email Universe
Students will define the sample space of all possible emails and then identify characteristics that define subsets of spam and non-spam emails. They will use set theory to understand the relationships between these sets.Steps
Here is some basic scaffolding to help students complete the activity.Final Product
What students will submit as the final product of the activityA written report defining the sample space, the characteristics of spam and non-spam subsets, and Venn diagrams illustrating the relationships between these sets.Alignment
How this activity aligns with the learning objectives & standardsCovers S.CP.A.1 by having students define the sample space and subsets of spam and non-spam emails.Probability Rule Application
Students will learn and apply the addition and multiplication rules of probability to calculate the likelihood of an email being spam based on identified characteristics such as sender, subject keywords, and the presence of suspicious links.Steps
Here is some basic scaffolding to help students complete the activity.Final Product
What students will submit as the final product of the activityA presentation that demonstrates the application of the addition and multiplication rules in determining the probability of an email being spam, along with the calculations and justifications.Alignment
How this activity aligns with the learning objectives & standardsCovers S.CP.B.4 by applying the addition and multiplication rules to calculate probabilities of emails being spam or not spam based on identified characteristics.Conditional Probability Refinement
Students will use conditional probability to improve the accuracy of their spam detection system. They will calculate the probability of an email being spam given certain characteristics and refine their system based on these probabilities.Steps
Here is some basic scaffolding to help students complete the activity.Final Product
What students will submit as the final product of the activityA revised spam detection system algorithm that incorporates conditional probabilities, along with a report explaining the changes and their impact on accuracy.Alignment
How this activity aligns with the learning objectives & standardsCovers S.CP.B.4 by calculating conditional probabilities to refine spam detection accuracy based on multiple factors. Also aligns with learning goal 'Quantify uncertainty in data and make predictions based on statistical analysis.'Rubric & Reflection
Portfolio Rubric
Grading criteria for assessing the overall project portfolioSpam Email Detection System Rubric
Email Universe Definition and Representation
Assesses the student's ability to define the email universe, identify key characteristics, and represent relationships using set theory.Sample Space Definition
The clarity and accuracy of the sample space definition, including all possible email types.
Exemplary
4 PointsThe sample space is comprehensively and accurately defined, including all possible email types with detailed explanations. Examples are well-chosen and highly relevant.
Proficient
3 PointsThe sample space is accurately defined, covering most email types. Examples are relevant and clearly illustrate the definition.
Developing
2 PointsThe sample space definition is attempted but lacks clarity or completeness. Some email types are omitted, and examples may be unclear.
Beginning
1 PointsThe sample space definition is missing or inaccurate. There is a lack of understanding of what constitutes the sample space of emails.
Characteristic Identification and Justification
Identification and justification of characteristics for spam and non-spam emails. Includes why these characteristics are indicative of each category.
Exemplary
4 PointsExhaustively identifies and justifies characteristics of both spam and non-spam emails with insightful explanations. Justifications are supported by relevant data or examples.
Proficient
3 PointsClearly identifies and justifies key characteristics of spam and non-spam emails with logical explanations. Justifications are generally sound and appropriate.
Developing
2 PointsIdentifies some characteristics of spam and non-spam emails but lacks thorough justification. Explanations may be superficial or missing.
Beginning
1 PointsStruggles to identify relevant characteristics of spam and non-spam emails. Justifications are absent or illogical.
Venn Diagram Representation
Accuracy and effectiveness of Venn diagrams in illustrating relationships between spam and non-spam subsets (unions, intersections, complements).
Exemplary
4 PointsVenn diagrams are exceptionally accurate and insightful, clearly illustrating complex relationships between spam and non-spam subsets. Diagrams enhance understanding significantly.
Proficient
3 PointsVenn diagrams are accurate and effectively illustrate the relationships between spam and non-spam subsets. Diagrams are well-labeled and easy to understand.
Developing
2 PointsVenn diagrams contain inaccuracies or are unclear, making it difficult to understand the relationships between spam and non-spam subsets.
Beginning
1 PointsVenn diagrams are missing, or completely inaccurate, failing to illustrate the relationships between spam and non-spam subsets.
Report Quality and Clarity
The overall quality and clarity of the written report, including organization, grammar, and coherence.
Exemplary
4 PointsThe report is exceptionally well-written, organized, and coherent. It demonstrates a deep understanding of the concepts and presents them in a clear and compelling manner. Grammar and mechanics are flawless.
Proficient
3 PointsThe report is well-written, organized, and coherent. It demonstrates a good understanding of the concepts and presents them clearly. Grammar and mechanics are generally correct.
Developing
2 PointsThe report is somewhat disorganized and lacks clarity in places. It demonstrates a basic understanding of the concepts but may contain grammatical errors or inconsistencies.
Beginning
1 PointsThe report is poorly written, disorganized, and difficult to understand. It demonstrates a limited understanding of the concepts and contains numerous grammatical errors.
Probability Rule Application and Presentation
Evaluates the student's ability to apply probability rules to classify emails and present their findings effectively.Understanding of Probability Rules
Accuracy and completeness of research and understanding of the addition and multiplication rules of probability.
Exemplary
4 PointsDemonstrates a comprehensive and nuanced understanding of the addition and multiplication rules with clear explanations and insightful examples beyond the basic definitions.
Proficient
3 PointsDemonstrates a solid and accurate understanding of the addition and multiplication rules, explaining them clearly with relevant examples.
Developing
2 PointsDemonstrates a partial understanding of the addition and multiplication rules, with some inaccuracies or gaps in explanation.
Beginning
1 PointsDemonstrates a limited or inaccurate understanding of the addition and multiplication rules.
Sample Email and Characteristic Selection
Appropriateness and relevance of the chosen sample emails and identified characteristics for applying probability rules.
Exemplary
4 PointsSelects highly appropriate and diverse sample emails, identifying insightful and relevant characteristics that maximize the effectiveness of probability rule application.
Proficient
3 PointsSelects appropriate sample emails and identifies relevant characteristics for applying probability rules effectively.
Developing
2 PointsSelects sample emails and identifies characteristics, but some may be inappropriate or irrelevant for applying probability rules.
Beginning
1 PointsSelects inappropriate sample emails and struggles to identify relevant characteristics for applying probability rules.
Accuracy of Probability Calculations
Accuracy of probability calculations using the multiplication and addition rules.
Exemplary
4 PointsPerforms calculations with impeccable accuracy, demonstrating a mastery of the rules and their application to the email samples.
Proficient
3 PointsPerforms calculations accurately, demonstrating a good understanding of the rules and their application.
Developing
2 PointsMakes some errors in calculations, indicating a need for further practice and understanding of the rules.
Beginning
1 PointsMakes significant errors in calculations, demonstrating a lack of understanding of the rules and their application.
Presentation Clarity and Effectiveness
Clarity and effectiveness of the presentation in demonstrating the calculations and explaining the application of probability rules.
Exemplary
4 PointsThe presentation is exceptionally clear, engaging, and insightful, effectively demonstrating the calculations and explaining the application of probability rules with a high level of understanding.
Proficient
3 PointsThe presentation is clear and effective, demonstrating the calculations and explaining the application of probability rules in a well-organized manner.
Developing
2 PointsThe presentation is somewhat unclear or disorganized, making it difficult to fully understand the calculations and application of probability rules.
Beginning
1 PointsThe presentation is unclear, disorganized, and fails to effectively demonstrate the calculations or explain the application of probability rules.
Conditional Probability and System Refinement
Assesses the student's ability to refine a spam detection system using conditional probability.Understanding of Conditional Probability
Understanding and explanation of conditional probability and its relevance to spam detection.
Exemplary
4 PointsDemonstrates a sophisticated understanding of conditional probability, explaining its nuances and implications for spam detection with insightful examples and connections to real-world applications.
Proficient
3 PointsDemonstrates a clear and accurate understanding of conditional probability, explaining its relevance to spam detection with appropriate examples.
Developing
2 PointsDemonstrates a partial understanding of conditional probability, with some inaccuracies or gaps in explanation and relevance to spam detection.
Beginning
1 PointsDemonstrates a limited or inaccurate understanding of conditional probability and its relevance to spam detection.
Data Collection Quality and Relevance
Quality and relevance of data collected on email characteristics and spam/non-spam classification.
Exemplary
4 PointsCollects highly relevant and comprehensive data on email characteristics and spam/non-spam classification, ensuring a robust dataset for accurate conditional probability calculations.
Proficient
3 PointsCollects relevant data on email characteristics and spam/non-spam classification, providing a sufficient dataset for conditional probability calculations.
Developing
2 PointsCollects some data on email characteristics and spam/non-spam classification, but the data may be incomplete or lack relevance.
Beginning
1 PointsCollects limited or irrelevant data on email characteristics and spam/non-spam classification, hindering conditional probability calculations.
Accuracy and Incorporation of Calculations
Accuracy of conditional probability calculations and their incorporation into the spam detection system.
Exemplary
4 PointsPerforms conditional probability calculations with impeccable accuracy and seamlessly integrates them into the spam detection system, significantly improving its performance.
Proficient
3 PointsPerforms conditional probability calculations accurately and incorporates them effectively into the spam detection system, improving its performance.
Developing
2 PointsMakes some errors in conditional probability calculations or struggles to effectively incorporate them into the spam detection system.
Beginning
1 PointsMakes significant errors in conditional probability calculations and fails to effectively incorporate them into the spam detection system.
Testing, Evaluation, and Reporting
Thoroughness of testing, evaluation, and reporting on the revised spam detection system's performance.
Exemplary
4 PointsConducts rigorous testing and evaluation of the revised system, providing a comprehensive and insightful report that clearly articulates the impact of conditional probabilities on accuracy and overall performance. Includes error analysis and suggestions for future improvements.
Proficient
3 PointsTests and evaluates the revised system, providing a clear report that explains the impact of conditional probabilities on accuracy and performance.
Developing
2 PointsProvides limited testing and evaluation of the revised system, with a report that lacks detail or clarity regarding the impact of conditional probabilities on accuracy and performance.
Beginning
1 PointsFails to adequately test and evaluate the revised system, with a report that is incomplete or missing key information.