Visiting Apache spark as a noob

Spark has been the recent buzzword in the data science industry. From the data analysts to data scientists, everyone has been talking about it – even the experts recommend it! Amazon, Yahoo, eBay are all employing Apache Spark in their projects. But what really, is Spark? Do you need it? Are you a noob and are struggling to understand the basics? We’ve got you covered!

Apache Spark is an open source cluster computing framework that is used for real time analysis of big data. Donated to Apache by the developers from University of California Berkeley, Spark has evolved to be Apache’s A starrer in the recent years.

We have compiled the hipster’s guide to Apache Spark implementation services, introducing the basics and the necessary advance topics here:

Who needs Real Time Analytics?

There are over 4 billion likes on Facebook, over 300 million tweets on twitter, over 18,000 votes on reddit, 300 hours of video on YouTube every minute! This data, colossal amounts of it, can be used for the benefit of mankind in a number of ways. However, storing, handling and manipulating such huge amount of data is not easy. Until now, Hadoop was used to do such tasks. However, Spark is now a cheaper, faster and better mechanism to handle such big data analytics.

Real time analysis in healthcare helps doctors track medicine consumption of patients, and also allows them to connect to other hospitals when in need of blood or organs. National defense forces use real time analysis to track threatened places and stay in touch with the government. Corporations use real time analysis to please their customers and reduce churn rates. The banking sector uses real time analysis to detect frauds. Stock market is dependent on predictions performed by models based on real time analysis. There is hardly a field in life where real time analytics would not be of use – and Spark helps us simplify the processes involved in real time analysis.

But why Spark – we had Hadoop!

Hadoop – the distributed data management system before Spark – can only perform batch processing using MapReduce. This means Hadoop can analyze data that has already been stored rather than fresh, real time data. Now one may argue that Storm and Impala were handling the job well too. However, Apache Storm/S4 could only do stream processing. Impala and Tez on the other hand were capable of handling interactive processing only. Neo4j and Apache Graph could process only graphs.

The need for a powerful engine that could handle batch and stream processing while giving sub-second responses and performing in-memory processing was strongly felt. That’s when Mataei Zaharia came up with Spark that can handle graph processing, interactive processing, batch and stream processing – all while being fast, effective and easy to use.

Components of Apache Spark

What goes behind making Spark faster and better than Hadoop?

  1. Spark Core: This is the Spark Kernel that supports all Spark related applications.
  2. Spark SQL: This helps run SQL queries on Spark. Using Spark SQL one can process structured as well as unstructured data, and it provides for an engine where Hive can run unmodified queries 100 times faster than Hadoop.
  3. Spark Streaming: One can live stream data to use the interactive processing of Spark. Live streams are basically micro batches run on top of Spark.
  4. Spark Mlib: A data science favorite, Mlib provides great machine learning algorithms that are efficient and powerful. In memory data processing improves the overall experience, making this the top one reason why people choose Apache Spark.
  5. Spark GraphX: One can process graphs using this engine.
  6. SparkR: The extensive usability of R gives data scientists a way to interact with their data while maintaining a presentable user interface. The scalability of Spark is best harnessed with R.
  7. Resilient distributed Data Sets – RDDs: This is the fundamental unit of data in Apache Spark. RDDs are immutable but one can always form new RDDs by transforming old ones. These consist of a distribution of cluster nodes capable of parallel operations. You can create RDDs in a number of ways. To create parallelized collections one can call the parallelized method in the driver program. One can also use the text File method, where the URL of the text file is given. And transforming existing data is also a fool proof way of getting RDDs.
  8. Spark Shell: Imagine a power shell on Spark, that’s Spark Shell for you! All general command line operations are supported on the Spark Shell.

If you have Java and Scala installed, it’s time to launch into Spark. Log on to http://spark.apache.org/downloads.html to download the latest spark version. Open your terminal and extract the zip file by writing: tar -xvf spark-2.1.0-bin-hadoop2.7.tgz

Set the path using the following commands:

export SPARK_HOME=Path_Where_Spark_Is_Installed

export PATH=$PATH:$SPARK_HOME/bin

Once this is done, you are free to start using Spark. You can begin with a simple real life example like an earthquake detection system. You can use polyglot to pull in high level APIs, and can even integrate Hadoop ecosystem components.

The easy UI provided by Spark is a blessing. One need not learn new languages or interfaces to work with Apache Spark. Built on Scala, you can use Java, Python and R to work on it. Twitter Sentiment Analysis, NBA Game predictions, and Earthquake Detection Systems can be your starting point. The more you practice, the better you get!

Happy Data Processing!

Leave a Comment