Doing Data Science

I’ve been reading an excellent book by Cathy O’Neil and Rachel Schutt. “Doing Data Science: straight talk from the frontline” (O’Reilly, 2013). The book contains many practical examples of data science, in many instances accompanied by R code. The technical detail is valuable but it is the balance and insight of the commentary that really struck me. A few of the things that resonated for me:

  • data science is special and distinct from statistics because a data product (e.g., a recommendation system) “gets incorporated back into the real world, and users interact with that product, and that generates more data, which creates a feedback loop.” (p. 42). In other words, data models can be strongly ‘performative’.
  • drawing on Josh Wills’ definition: “data scientist (noun): Person who is better at statistics than any software engineer and better at software engineering than any statistician”
  • data science isn’t just about models and algorithms, “They spend a lot more time trying to get data into shape than anyone cares to admit maybe up to 90% of their time” (p. 351). Getting data into shape involves data acquisition, data structuring, data cleaning (outliers, missing values, etc.), transformations. Many statistics courses start with fairly clean and well-structured data, spending 90% of the time on models and only 10% on herding data.
  • data scientists need to be able to ask questions, be able to say “I don’t know” and not be blinded by money: “They seek out opportunities to solve problems of social value and they try to consider the consequences of their models” (p. 355).
