Hands-On Big Data Analytics with PySpark
上QQ阅读APP看书,第一时间看更新

What is parallelization?

The best way to understand Spark, or any language, is to look at the documentation. If we look at Spark's documentation, it clearly states that, for the textFile function that we used last time, it reads the text file from HDFS.

On the other hand, if we look at the definition of parallelize, we can see that this is creating an RDD by distributing a local Scala collection.

So, the main difference between using parallelize to create an RDD and using the textFile to create an RDD is where the data is sourced from.

Let's look at how this works practically. Let's go to the PySpark installation screen, from where we left off previously. So, we imported urllib, we used urllib.request to retrieve some data from the internet, and we used SparkContext and textFile to load this data into Spark. The other way to do this is to use parallelize.

Let's look at how we can do this. Let's first assume that our data is already in Python, and so, for demonstration purposes, we are going to create a Python list of a hundred numbers as follows:

a = range(100)
a

This gives us the following output:

range(0, 100)

For example, if we look at a, it is simply a list of 100 numbers. If we convert this into a list, it will show us the list of 100 numbers:

list (a)

This gives us the following output:

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
...

The following command shows us how to turn this into an RDD:

list_rdd = sc.parallelize(a)

If we look at what list_rdd contains, we can see that it is PythonRDD.scala:52, so, this tells us that the Scala-backed PySpark instance has recognized this as a Python-created RDD, as follows:

list_rdd

This gives us the following output:

PythonRDD[3] at RDD at PythonRDD.scala:52

Now, let's look at what we can do with this list. The first thing we can do is count how many elements are present in list_rdd by using the following command:

list_rdd.count()

This gives us the following output:

100

We can see that list_rdd is counted at 100. If we run it again without cutting through into the results, we can actually see that, since Scala is running in a real time when going through the RDD, it is slower than just running the length of a, which is instant.

However, RDD takes some time, because it needs time to go through the parallelized version of the list. So, at small scales, where there are only a hundred numbers, it might not be very helpful to have this trade-off, but with larger amounts of data and larger individual sizes of the elements of the data, it will make a lot more sense.

We can also take an arbitrary amount of elements from the list, as follows:

list_rdd.take(10)

This gives us the following output:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

When we run the preceding command, we can see that PySpark has performed some calculations before returning the first ten elements of the list. Notice that all of this is now backed by PySpark, and we are using Spark's power to manipulate this list of 100 items.

Let's now use the reduce function in list_rdd, or in RDDs in general, to demonstrate what we can do with PySpark's RDDs. We will apply two parameter functions as an anonymous lambda function to the reduce call as follows:

list_rdd.reduce(lambda a, b: a+b)

Here, lambda takes two parameters, a and b. It simply adds these two numbers together, hence a+b, and returns the output. With the RDD reduce call, we can sequentially add the first two numbers of RDD lists together, return the results, and then add the third number to the results, and so on. So, eventually, you add all 100 numbers to the same results by using reduce.

Now, after some work through the distributed database, we can now see that adding numbers from 0 to 99 gives us 4950, and it is all done using PySpark's RDD methodology. You might recognize this function from the term MapReduce, and, indeed, it's the same thing.

We have just learned what parallelization is in PySpark, and how we can parallelize Spark RDDs. This effectively amounts to another way for us to create RDDs, and that's very useful for us. Now, let's look at some basics of RDD operation.