STATS216v Introduction to Statistical LearningStanford University, Summer 2018Problem Set 1
Due: Friday, July 6Remember the university honor code. All work and answers must be your own.
1. Explain whether each scenario below is a regression, classification or unsupervised
learning problem. If it is a supervised learning scenario, indicate whether we are more
interested in inference or prediction. Finally, provide in each case the number of observations, n, and the number of predictors, p.
(a) Oil excavation is a very expensive process and the oil resources are not distributed
uniformly in an area, so it is important to find the best spots for oil extraction. To
do this engineers consider a very coarse grid (each edge length is on the order of
miles), dig a well at the vertices of the grid and take a sample of the sand there.
24 different measurements are taken from each sand sample. An engineer has sand
samples for 35 locations where they know the results of the digging (how much
oil was present at that location). Additionally, the engineer has sand samples for
80 prospective well locations, and wishes to find the most promising spot to dig a
future well.
B to each customer on the basis of collected customer demographics (age, zip code,
and gender). A set of 300 of its customers have already expressed a preference for
(c) A policy analyst is interested in discovering factors that are associated with the
unemployment rate across different U.S. cities. For each of 400 cities, the policy
analyst gathers the following data: the population, state, average income, crime
rate, percentage of students who graduate high school and unemployment level.
(d) Stanford received 42,000 undergraduate applications in the year 2017. The application includes the following data for each applicant: age, high school GPA, scores in
the SAT Critical Reading, SAT Math and SAT Writing exams, and whether they
are domestic or international. The university wishes to understand the different
subtypes of students in the application pool.
(e) A neuroscientist wishes to develop a tool that can identify the type of cells based
on a few measurements. Each cell is one of three types: glial cell, motor neuron
cell, or horizontal cell. The neuroscientist has 68 labeled cells, each with three measurements available: the number of branch points, the number of active processes,
and the average process length.2. You are a data science consultant! In each of the following cases, decide whether you
would suggest a flexible regression model or an inflexible one. Provide your reasons as
clearly as possible.
(a) In the study of breast cancer, a scientist is trying to find the genes associated with
breast cancer. The total number of genes in the study is 50,000 and the number
of patients is 120.
(b) The Ministry of Education in a certain country wants to identify students who
need extra help. They wish to design a system which estimates student performance in the final 8th grade math exam based on their math, science and history
grades in the 7th grade. To do this, they want to run a regression on the data
from all the students who have graduated from the 8th grade in the last 10 years.
(c) Kelly is a very hardworking chemistry student and she has run an experiment
to find a mathematical expression that relates the speed of corrosion of iron to
the humidity and temperature of the environment, and the percentage of different elements in the alloy. Unfortunately, the lab that she is working in was
established in 1967 and the equipment has not been changed since then. This has
caused measurements to vary significantly between different experimental runs,
even when the parameters were the same. She is skeptical about the quality of
her measurements of the speed of corrosion.
(d) Kelly’s advisor won the Nobel prize in chemistry and used the prize money to
outfit the lab with the most modern equipment. Kelly ran her experiments again
with the new equipment and now she can trust her numbers. However, her advisor
believes that she should not expect that the real relationship be linear.
3. For each of parts (a) through (e), indicate whether you would…

Choose any product and identify a plan to warehouse and distribute it. Include in your plan a few distribution trade-offs.
