Use global Map to manage the alias of Spark DataFrame
Posted onEdited on
Pain points
When writing Spark application, we can set the alias for DataFrame while we cannot get a DataFrame’s alias.
This is because when we use df.alias(), Spark will create a org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias,
which is a logical plan node. This means that the alias may not be a field inside the DataFrame, it may be just a
logical plan to be executed in the runtime.
If in one function we set an alias for a DataFrame, then we want to use the alias in another function, we need to declare
a global fields to let us access the same alias in different functions. This will cause code smell.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
defpainPoint(): Unit = { val theAlias = "an_alias"
deffunction2(df: DataFrame): DataFrame = { df.join(getOsAndPrice.alias("osAndPrice"), Seq("id"), "left_outer") .drop(col(s"$theAlias.os")) // Here we should keep 'theAlias' field // drop(s"$theAlias.os") doesn't works, will see the root cause later }
So we can use 2 Map to implement this manager. To prevent the parallel issue, we use ConcurrentHashMap as a place to store
the mapping relationship, and we use synchronized() to make the get or set operator an atomic operator.