Over the holiday I spent some time to make some progress of moving one of my machine learning project into Spark. An important piece of the project is a data transformation library with pre-defined functions available. The original implementation uses
pandas dataframe and runs on a single machine. Given the size of our data gets much bigger, sometimes we have to use a giant 512GB memory Azure VM to run the transformation and it takes a long time to run the entire transformation or I have chunk the data then transform in batches (which is not a good idea for column based transformation such as feature normalization). Another blocking issue is the intermediate memory consumption can be really high – 10x of the original data size.
So I decided to give Spark a try since I do not have to move data around once I put the data into Azure Blob Storage.