Introduction to Machine Learning and Data Mining

Introduction to Machine Learning and Data Mining

Tufts University COMP 135 is designed to give students a comprehensive introduction to machine learning and data mining. The course covers a breadth of topics in machine learning, providing hands-on experience and theoretical understanding.

Course Overview

Machine learning is the study of algorithmic methods for learning and prediction based upon data. Approaches range from extracting patterns from large collections of data to online learning in real-time applications. ML is becoming increasingly widespread due to the accessibility of computational power and datasets, along with recent advances in ML algorithms. This course focuses on a broad introduction to ML, requiring significant cognitive effort from students.

Ideal candidates are upper-level undergraduates or beginning graduate students comfortable with mathematical techniques and programming. Relevant mathematical knowledge includes statistics, probability, calculus, and linear algebra.

Class Times

  • Tu, Th 10:30AM - 11:45AM
  • Location: Tisch Library, 304-Auditorium

Instructor

Teaching Assistants

  • Sepideh Sadeghi - [email protected]
    • Office Hours: Mon noon-1pm, Fri 10am-noon, Location: Halligan 121
  • Hao Cui - [email protected]
    • Office Hours: Tue 4:30-5:30 pm, Thu 4:30-5:30 pm, Location: Halligan 121

Grading

  • Written homework assignments (20%)
  • Quizzes (20%)
  • In-class midterm exam (20%): March 17
  • Final project (40%)

Rules for Late Submissions

All work must be turned in on the date specified. Notify Kyle Harrington of special circumstances at least two days in advance.

Collaboration Policy

  • Homework assignments and projects: Discussion about problems and concepts is encouraged, but each assignment must be completed individually.
  • Quizzes and exams: No collaboration allowed.

Tentative List of Topics

  • Supervised Learning: nearest neighbors, decision trees, linear classifiers, Bayesian classifiers; feature processing and selection; avoiding over-fitting; experimental evaluation.
  • Unsupervised learning: clustering algorithms; generative probabilistic models; the EM algorithm; association rules.
  • Theory: basic PAC analysis for classification.
  • More supervised learning: neural networks; backpropagation; dual perceptron; kernel methods; support vector machines.
  • Additional topics: active learning; aggregation methods; time series models; reinforcement learning.

Reference Material

  • Primary Text: Machine Learning. Tom M. Mitchell, McGraw-Hill, 1997
  • Introduction to Machine Learning, Ethem Alpaydin, 2010.
  • Data Mining: Practical Machine Learning Tools and Techniques. Ian H. Witten, Eibe Frank, 2005.

Programming and Software

  • Weka: A machine learning package used for some assignments.
  • Languages: Python, Java, Julia, Matlab, Clojure, R
  • Jupyter: A notebook-based programming environment used for in-class demos and assignments.

Assignments, Quizzes, and Exams

Assignment 1

  • Download Weka
  • Open Weka, Choose Explorer models
  • Load the dataset with “Open file…”

Submission: Write a one-paragraph description of findings, including visualizations and clustering results.

Assignment 2 (Bonus)

  • Git and GitHub: Clone the repository, complete the assignment in a Jupyter Notebook, and submit a pull request.

Final Projects

  • Proposals Due: March 7
  • Project Due: May 5

Submission: An 8-12 page paper with a detailed project report.

Resources

  • Faculty resources: Ask around for interesting datasets.
  • Google Scholar: Search for articles published in “ICML”, “NIPS”, or “Machine Learning”.
  • Datasets: A comprehensive list of datasets for machine learning projects.

Course Content

Detailed slides, assignments, and project guidelines can be found on the course GitHub page.

For more information and resources, visit the course GitHub repository.