Is practical ML all about data preparation?

So… I just watched this talk by Andrej Karpathy (the head of AI at Tesla).

The talk is two years old, but still very good I think. One slide from the talk caught my attention in particular tho:

I’m a masters student and I wonder if anyone of you with practical experience can confirm if this is really true?

Hey the_donald (that username tho),

This is basically true. The reason being: you use tried and tested methods and models in a production environment. Since most work on these methods has already been done, there is not much to do when using them in a production environment. For real world applications, the data is the most important part. The real world is chaotic, and there is always something unexpected popping up which is not accounted for in your data. This makes neural networks biased which leads to problems and errors.

PhDs on the other hand focus on novel research ideas on methods and models mostly, using pre-curated research oriented, cleaned and filtered datasets with established metrics and benchmarks. So there’s not much to do on the data side.

Think of adding good data to an ML model in a production environment as adding an SSD to a computer that uses a traditional hard drive. It is the single biggest improvement you can make. While you can change hyper-parameters and tweak model architectures to see gains in performance as well, these changes are akin to adding more RAM or a faster CPU to a computer. It’s an upgrade, and you might feel a slight speed bump, but it’s nothing like the difference between a hard drive and a solid state drive.

Hope that analogy worked for you.