In software industry with adoption of Hadoop, data scientists are in high demand. There is a well known fact that people from data science background always face difficulty to apply data science on bigdata due to lack of bigdata knowledge and people from programming background face the same when they try data science on bigdata due to lack of data science knowledge. Here we are seeing two different set of people whose end goal “Machine Learning on Big Data” appears similar. So we try to solve this and give you the correct steps to get started in this regard.

Where to start Machine Learning?

We followed Machine Learning course “Stanford Machine Learning” in coursera, where we implement ML in Octave. This course covers all the basic algorithms of machine learning in great depth. In this course we implement algorithms in octave – similar to Matlab for implement algorithms. Octave is not enough to get started in bigdata as it is limited to its in-memory calculations. We need the same for Big-Data . So we thought of solving the same exercises on Spark. Our team from Zinnia Systems want to put forward our machine learning work through our GitHub repository. Our repository contains all the exercises ported from octave to spark, of the above mentioned course. We hope it will accelerate your Machine Learning on Bigdata at higher speeds, when you start learning by comparing octave and spark code.

Why Spark?

Spark’s in memory processing and emphasis on iterative processing, makes it best suited for Machine Learning

With the advent of Spark on big-data gap between data scientists and the programming communities started decreasing very quickly. One thing­­ that kept Spark ahead of all other big-data frameworks is its basic design ideas of framework like, high emphasis on iterative algorithms and in-memory processing to achieve the landmark “100x faster than Hadoop”. When we started working with spark we felt very much fond of RDD (Resilient Distributed Datasets) which are basically distributed immutable collections acting as basic abstraction in framework. Along with this Spark also provides more complex operations like joins , group-by and reduce-by which are very handy in forming complex model data-flows without iterations. By this time you would be aware of the fact that the most happening thing in big-data is machine learning on Spark. So, our team from Zinnia Sytems started the machine learning on Spark.

Machine Learning Library (MLLIB)

MLLIB is the fastest growing machine learning community with more than 137 active contributors

Everybody who tries to start using machine learning in Spark will end up in MLLIB . In fact MLLIB is the fastest growing community with more than 137 active contributors. It has already pushed Apache-Mahout to back seat in developing machine learning algorithms. MLLIB already includes algorithms like Linear Regression and Logistic Regression using Stochastic Gradient Descent, K-Means, Recommender Systems and Linear Support Vector Machines etc. MLLIB is the part of a larger machine learning project MLBASE which includes an API for feature extraction and an optimizer. It is not a better practice to start implementing the algorithms as we are seeing there are active people who are putting their work forward ,only after they test them in adequate conditions. So, here comes our team who want to help you in using the MLLIB from the first day. We give you the basic programs to use all algorithms in MLLIB in one shot though our repo. It will be a very interesting journey. You would be amazed to start your nightmare of Machine Learning on bigdata using Spark. You must be excited to start this venture!!!!!

As you know Zinnia Systems is an active contributor in the open-source , this is our next step in the same direction. Come, try, increase your knowledge in those areas which are never heard before.

Leave a reply