• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, March 10, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Drilling into Spark’s ALS Recommendation algorithm

Josh by Josh
June 9, 2025
in Al, Analytics and Automation
0
Drilling into Spark’s ALS Recommendation algorithm

READ ALSO

marvn.ai and the rise of vertical AI search engines

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs


  • February 25, 2017
  • Vasilis Vryniotis
  • . 3 Comments

The ALS algorithm introduced by Hu et al., is a very popular technique used in Recommender System problems, especially when we have implicit datasets (for example clicks, likes etc). It can handle large volumes of data reasonably well and we can find many good implementations in various Machine Learning frameworks. Spark includes the algorithm in the MLlib component which has recently been refactored to improve the readability and the architecture of the code.

Spark’s implementation requires the Item and User id to be numbers within integer range (either Integer type or Long within integer range), which is reasonable as this can help speed up the operations and reduce memory consumption. One thing I noticed though while reading the code is that those id columns are being casted into Doubles and then into Integers at the beginning of the fit/predict methods. This seems a bit hacky and I’ve seen it put unnecessary strain on the garbage collector. Here are the lines on the ALS code that cast the ids into doubles:

To understand why this is done, one needs to read the checkedCast():

This UDF receives a Double and checks its range and then casts it to integer. This UDF is used for Schema validation. The question is can we achieve this without using ugly double castings? I believe yes:


  protected val checkedCast = udf { (n: Any) =>
    n match {
      case v: Int => v // Avoid unnecessary casting
      case v: Number =>
        val intV = v.intValue()
        // True for Byte/Short, Long within the Int range and Double/Float with no fractional part.
        if (v.doubleValue == intV) {
          intV
        }
        else {
          throw new IllegalArgumentException(s"ALS only supports values in Integer range " +
            s"for columns ${$(userCol)} and ${$(itemCol)}. Value $n was out of Integer range.")
        }
      case _ => throw new IllegalArgumentException(s"ALS only supports values in Integer range " +
        s"for columns ${$(userCol)} and ${$(itemCol)}. Value $n is not numeric.")
    }
  }

The code above shows a modified checkedCast() which receives the input, checks asserts that the value is numeric and raises exceptions otherwise. Since the input is Any, we can safely remove all the cast to Double statements from the rest of the code. Moreover it is reasonable to expect that since the ALS requires ids within integer range, the majority of people actually use integer types. As a result on line 3 this method handles Integers explicitly to avoid doing any casting. For all other numeric values it checks whether the input is within integer range. This check happens on line 7.

One could write this differently and explicitly handle all the permitted types. Unfortunately this would lead to duplicate code. Instead what I do here is convert the number into Integer and compare it with the original Number. If the values are identical one of the following is true:

  1. The value is Byte or Short.
  2. The value is Long but within the Integer range.
  3. The value is Double or Float but without any fractional part.

To ensure that the code runs well I tested it with the standard unit-tests of Spark and manually by checking the behavior of the method for various legal and illegal values. To ensure that the solution is at least as fast as the original, I tested numerous times using the snippet below. This can be placed in the ALSSuite class in Spark:



  test("Speed difference") {
    val (training, test) =
      genExplicitTestData(numUsers = 200, numItems = 400, rank = 2, noiseStd = 0.01)

    val runs = 100
    var totalTime = 0.0
    println("Performing "+runs+" runs")
    for(i <- 0 until runs) {
      val t0 = System.currentTimeMillis
      testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01, targetRMSE = 0.1)
      val secs = (System.currentTimeMillis - t0)/1000.0
      println("Run "+i+" executed in "+secs+"s")
      totalTime += secs
    }
    println("AVG Execution Time: "+(totalTime/runs)+"s")

  }

After a few tests we can see that the new fix is slightly faster than the original:

Code

Number of Runs

Total Execution Time

Average Execution Time per Run

Original 100 588.458s 5.88458s
Fixed 100 566.722s 5.66722s

 

I repeated the experiments multiple times to confirm and the results are consistent. Here you can find the detailed output of one experiment for the original code and the fix. The difference is small for a tiny dataset but in the past I’ve managed to achieve a noticeable reduction in GC overhead using this fix. We can confirm this by running Spark locally and attaching a Java profiler on the Spark instance. I opened a ticket and a Pull-Request on the official Spark repo but because it is uncertain if it will be merged, I thought to share it here with you and it is now part of Spark 2.2.

Any thoughts, comments or critisism are welcome! 🙂



Source_link

Related Posts

marvn.ai and the rise of vertical AI search engines
Al, Analytics and Automation

marvn.ai and the rise of vertical AI search engines

March 10, 2026
Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs
Al, Analytics and Automation

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs

March 10, 2026
VirtuaLover Image Generator Pricing & Features Overview
Al, Analytics and Automation

VirtuaLover Image Generator Pricing & Features Overview

March 9, 2026
Al, Analytics and Automation

The ‘Bayesian’ Upgrade: Why Google AI’s New Teaching Method is the Key to LLM Reasoning

March 9, 2026
Pricing Breakdown and Core Feature Overview
Al, Analytics and Automation

Pricing Breakdown and Core Feature Overview

March 9, 2026
Improving AI models’ ability to explain their predictions | MIT News
Al, Analytics and Automation

Improving AI models’ ability to explain their predictions | MIT News

March 9, 2026
Next Post
Win at TikTok Ads in 2025 with Contextual Targeting

Win at TikTok Ads in 2025 with Contextual Targeting

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

What are the best employee advocacy program examples?

What are the best employee advocacy program examples?

October 24, 2025
Google Seems More Biased Towards Big Brands Than ChatGPT and Perplexity

Google Seems More Biased Towards Big Brands Than ChatGPT and Perplexity

July 12, 2025
Experiential Marketing Trend of the Week: Dog-First Events

Experiential Marketing Trend of the Week: Dog-First Events

December 15, 2025
How to Spot Problematic Meta Ad Placements

How to Spot Problematic Meta Ad Placements

August 26, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • I Used Google’s New Gemini-Powered ‘Help Me Create’ Tool in Docs. It’s Great at Corporate-Speak
  • My Picks Based on G2 Data
  • Gutter Parties and Pollen PuttPutt
  • How Google AI improved breast cancer detection in the UK
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions