Machine Learning Field Notes

Cover Image for Machine Learning Field Notes
Jan Piotrzkowski
Jan Piotrzkowski

Relationship between AI, ML, and Data Science

“The exciting new effort to make computers think … machines with minds, in the full and literal sense.” - McCarthy et al., 1955

  • Artificial Intelligence (AI) covers every attempt to make machines behave intelligently-understanding language, recognizing images, making decisions.
  • Machine Learning (ML) lives inside AI. It focuses on algorithms that learn patterns from data rather than being programmed rule by rule.
  • Data Science blends statistics, computation, and domain knowledge to extract insights from data - ML is one of the primary tools in that toolkit.

Picture AI as the umbrella, ML as the collection of learning methods, and Data Science as the applied discipline using those methods on real data.

What can Machine Learning do?

ML helps answer two complementary questions:

  • Prediction - “What will happen?” (e.g., will it rain tomorrow?)
  • Inference - “Why or how is it happening?” (e.g., why does it rain?)

Better inferences improve predictions, and confident predictions guide which inferences to explore next.

What is a Machine Learning Model?

“A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset.” - What Are Machine Learning Models?, n.d.

“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” - Mitchell, 1997

During training, algorithms digest data. The resulting model captures what was learned so it can make predictions or decisions on fresh inputs.

Types of Machine Learning

Three families cover most practical work:

  1. Supervised learning - learn mappings from inputs (features) to outputs (labels) when labeled data is available.
  2. Unsupervised learning - explore unlabeled data to uncover structure such as clusters or latent representations.
  3. Reinforcement learning - let an agent interact with an environment, receive rewards or penalties, and optimize for long-term reward.

Machine Learning Workflow

“Machine learning workflow development is anecdotally regarded to be an iterative process of trial-and-error with humans-in-the-loop.” - Xin et al., 2018

  1. Collect and prepare data - gather raw sources (images, text, sensor readings), clean errors/duplicates, and structure everything for analysis.
  2. Feature extraction and selection - identify relevant attributes (pixel intensities, word frequencies), transform raw data, and reduce to the most informative signals when needed.
  3. Split data - separate training and testing (validation) sets so evaluation reflects unseen data.
  4. Train the model - choose an algorithm (decision tree, neural network, SVM) and adjust parameters to learn from the training set.
  5. Evaluate - test on unseen data with metrics suited to the problem to confirm the model learned transferable patterns.
  6. Tune and improve - refine features, gather more data, tune hyperparameters, or change algorithms based on evaluation results.
  7. Deploy and monitor - ship the model and continue monitoring, because data drift eventually requires retraining.

Supervised Learning

  • Features are attribute-value pairs describing an example (e.g., “color is blue”). - Provost, 1998
  • Labels are the expected outputs the model should learn (e.g., “diabetic” or “not diabetic”).
  • Classifiers map measured features to the label. - Loog, 2017

The learning loop builds a function f(features) -> label. Disease detection uses health metrics as features and the diagnosis as the label. Microscopy workflows use morphological descriptors as features and the known cell type as the label.

Classification vs. Regression

  • Classification sorts inputs into discrete categories (benign vs. malignant, protein family A vs. B).
  • Regression predicts continuous values (blood sugar level, binding affinity).

Both use the supervised framework. They simply optimize different loss functions and metrics.

Unsupervised Learning

When datasets lack labels, unsupervised learning uncovers structure:

  • Clustering - group similar samples (gene expression profiles, single-cell RNA sequences) to identify novel cell types or patient cohorts.
  • Dimensionality reduction - compress thousands of measurements into a handful of axes so patterns become visible and computation stays manageable.
  • Anomaly detection - flag rare events, such as an unusual mutation that explains a unique therapeutic response.
  • Association rule learning - mine “if-then” relationships to generate hypotheses (e.g., certain chemical motifs correlate with better drug performance).

Reinforcement Learning

Reinforcement learning (RL) handles situations where an agent interacts with an environment and receives rewards or penalties. Examples include robots that learn to move through trial-and-error and drug-design agents that generate molecule candidates, receiving rewards when a structure looks promising for a target protein. - Sutton & Barto, 2018

Evaluation Metrics

  • Choose metrics that reflect the cost of errors: precision/recall or AUROC when false positives and false negatives matter differently.
  • Track calibration whenever probabilities are shown to humans.
  • Monitor drift metrics (population stability index, KL divergence) to know when retraining is required.

Treat these guardrails like any other SOP so ML outputs stay auditable.

References

  • John McCarthy, Marvin Minsky, Nathaniel Rochester, Claude Shannon. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, 1955. Link
  • What Are Machine Learning Models? Microsoft Learn, n.d. Link
  • Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997. Link
  • Qian Xin et al. “Restructuring Conversations: A Machine Learning Workflow for Data Scientists.” CHI EA, 2018. Link
  • Foster Provost. Knowledge Discovery from Data for Business Applications. NYU Stern, 1998. Link
  • Johannes Loog. Supervised Classification: Quite a Brief Overview. Delft University of Technology, 2017. Link
  • Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. Link
  • Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018. Link