01. Which of the following DataFrame methods is classified as a transformation?
a) DataFrame.count()
b) DataFrame.show()
c) DataFrame.select()
d) DataFrame.foreach()
e) DataFrame.first()
02. If we want to create a constant integer 1 as a new column ‘new_column’ in a dataframe df, which code block we should select?
a) df.withColumnRenamed('new_column', lit(1))
b) df.withColumn(new_column, lit(1))
c) df.withColumn(”new_column”, lit(“1”))
d) df.withColumn(“new_column”, 1)
e) df.withColumn(“new_column”, lit(1))
03. Which of the following three DataFrame operations are classified as an action?
(Choose 3 answers)
a) PrintSchema()
b) Show()
c) First()
d) limit()
e) foreach()
f) cache
04. The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.
Code block: transactionsDf.join(itemsDf, "itemId", how="broadcast")
a) The syntax is wrong, how= should be removed from the code block.
b) The join method should be replaced by the broadcast method.
c) Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
d) The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf
e) broadcast is not a valid join type.
05. If spark is running in client mode, which of the following statement about is correct?
a) Spark driver is randomly attributed to a machine in the cluster
b) Spark driver is attributed to the machine that has the most resources
c) Spark driver remains on the client machine that submitted the application
d) The entire spark application is run on a single machine.
06. What command we can use to get the number of partition of a dataframe named df?
a) df.rdd.getPartitionSize()
b) df.getPartitionSize()
c) df.getNumPartitions()
d) df.rdd.getNumPartitions()
07. Which of the following are valid execution modes?
a) Kubernetes, Local, Client
b) Client, Cluster, Local
c) Server, Standalone, Client
d) Cluster, Server, Local
e) Standalone, Client, Cluster
08. The code blown down below intends to join df1 with df2 with inner join but it contains an error. Identify the error.
d1.join(d2, “inner”, d1.col(“id”) === df2.col(“id"))
a) The join type is not in right order. The correct query should be d2.join(d1, d1.col(“id”) === df2.col(“id"), “inner”)
b) There should be two == instead of ===. So the correct query is d1.join(d2, “inner”, d1.col(“id”) == df2.col(“id"))
c) Syntax is not correct d1.join(d2, d1.col(“id”) == df2.col(“id"), “inner”)
d) We cannot do inner join in spark 3.0, but it is in the roadmap.
09. Which of the following statements is NOT true for broadcast variables?
a) It provides a mutable variable that a Spark cluster can safely update on a per-row basis.
b) It is a way of updating a value inside of a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way.
c) You can define your own custom broadcast class by extending org.apache.spark.util.BroadcastV2 in Java or Scala or pyspark.AccumulatorParams in Python.
d) Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of serialized with every single task.
e) The canonical use case is to pass around a small large table that does fit in memory on the executors.
10. Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?
a) transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))
b) transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
c) transactionsDf.select(sqrt(predError))
d) transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
e) transactionsDf.select(sqrt("predError"))