Data Science 101: Data Agnosticism

Bits are bits. Whether you are searching for whales in audio clips or trying to predict hospitalization rates based on insurance claims, the process is the same: clean the data, generate features, build a model, and iterate. Better features lead to a better model, but without domain expertise it is often difficult to extract those features.

Numpy/Scipy, Matplotlib, Pandas, and Sci-kit Learn provide an excellent framework for data analysis and feature discovery. This is evidenced by high performing models in the Heritage Health Prize and the Marinexplore Right Whale Detection Challenge. In both competitions, the largest performance gains came from identifying better features. This requires being able to repeatedly visualize and characterize model successes and failures. Python provides this capability as well as the ability to rapidly implement and test new features.

This video below is a presentation from the SciPy 2013 conference that discusses how Python was used to develop competitive predictive models based on derived features discovered through data analysis – feature engineering without domain expertise.

 

 

Sign up for the free insideBIGDATA newsletter.

 

Resource Links: