• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Tuesday, March 10, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

New open-source Machine Learning Framework written in Java

Josh by Josh
June 16, 2025
in Al, Analytics and Automation
0
New open-source Machine Learning Framework written in Java

READ ALSO

VirtuaLover Image Generator Pricing & Features Overview

The ‘Bayesian’ Upgrade: Why Google AI’s New Teaching Method is the Key to LLM Reasoning


  • October 19, 2014
  • Vasilis Vryniotis
  • . 5 Comments
open-source

I am happy to announce that the Datumbox Machine Learning Framework is now open sourced under GPL 3.0 and you can download its code from Github!

What is this Framework?

The Datumbox Machine Learning Framework is an open-source framework written in Java which enables the rapid development of Machine Learning models and Statistical applications. It is the code that currently powers up the Datumbox API. The main focus of the framework is to include a large number of machine learning algorithms & statistical methods and be able to handle small-medium sized datasets. Even though the framework targets to assist the development of models from various fields, it also provides tools that are particularly useful in Natural Language Processing and Text Analysis applications.

What types of models/algorithms are supported?

The framework is divided in several Layers such as Machine Learning, Statistics, Mathematics, Algorithms and Utilities. Each of them provides a series of classes that are used for training machine learning models. The two most important layers are the Statistics and the Machine Learning layer.

The Statistics layer provides classes for calculating descriptive statistics, performing various types of sampling, estimating CDFs and PDFs from commonly used probability distributions and performing over 35 parametric and non-parametric tests. Such types of classes are usually necessary while performing explanatory data analysis, sampling and feature selection.

The Machine Learning layer provides classes can be used in a large number of problems including Classification, Regression, Cluster Analysis, Topic Modeling, Dimensionality Reduction, Feature Selection, Ensemble Learning and Recommender Systems. Here are some of the supported algorithms: LDA, Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and more.

Datumbox Framework VS Mahout VS Scikit-Learn

Both Mahout and Scikit-Learn are great projects and both of them have completely different targets. Mahout supports only a very limited number of algorithms which can be parallelized and thus use Hadoop’s Map-Reduce framework to handle Big Data. On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. Moreover it is developed in Python, which is a great language for prototyping and Scientific Computing but not my personal favourite for software development.

The Datumbox Framework sits in the middle of the two solutions. It tries to support a large number of algorithms and it is written in Java. This means that it can be incorporated easier into production code, it can easier be tweaked to reduce memory consumption and it can be used in real time systems. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, it is within my plans to expand it to handle large-sized datasets.

How stable is it?

The early versions of the framework (up to 0.3.x) were developed in August and September of 2013 and they were written in PHP (yeap!). During May and June 2014 (versions 0.4.x), the framework was rewritten in Java and enhanced with additional features. Both branches were heavily tested in commercial applications including the Datumbox API. The current version is 0.5.0 and it seems mature enough to be released as the first public alpha version of the framework. Having said that, it is important to note that some functionalities of the framework are tested more thoroughly than others. Moreover since this version is alpha, you should expect drastic changes on the future releases.

Why I wrote it and why I open-source it?

My involvement with Machine Learning and NLP dates back to 2009 when I co-founded WebSEOAnalytics.com. Since then I have been developing implementations of various machine learning algorithms for various projects and applications. Unfortunately most of the original implementations were very problem-specific and they could hardly be used in any other problem. In August 2013 I decided to start Datumbox as a personal project and develop a framework that provides the tools for developing machine learning models focusing in the area of NLP and Text Classification. My target was to build a framework that would be reused on the future for developing quickly machine learning models, incorporating it in projects that require machine learning components or offer it as a service (Machine Learning as a Service).

And here I am now, several lines of code later, open-sourcing the project. Why? The honest answer is that at this point, it is not within my plans to go through a “let’s build a new start-up” journey. At the same time I felt that keeping the code on my hard disk in case I need it on the future does not make sense. So the only logical thing to do was to open-source it. 🙂

Documentation?

If you read the previous two paragraphs, you should probably seen this coming. Since the framework was not developed having in mind that I would share it with others, the documentation is poor/non-existent. Most of the classes and public methods are not properly commented and there is no document describing the architecture of the code. Fortunately all the class names are self-explanatory and the framework provides JUnit tests for every public method & algorithm and these can be used as examples of how to use the code. I hope that with the help of the community we will build a proper documentation, so I am counting on you!

Current Limitations and Future Development

As in every piece of software (and especially the open-source projects in alpha version), the Datumbox Machine Learning Framework comes with its own unique and adorable limitations. Let’s dig into them:

  1. Documentation: As mentioned earlier, the documentation is poor.
  2. No Multithreading: Unfortunately the framework does not currently support Multithreading. Of course we should note that not all machine learning algorithms can be parallelized.
  3. Code Examples: Since the framework has just been published, you can’t find any code examples on the web other than those provided by the framework in the form of JUnit tests.
  4. Code Structure: Creating a solid architecture for any large project is always challenging, let alone when you have to deal with Machine Learning algorithms that differ significantly (supervised learning, unsupervised learning, dimensionality reduction algorithms etc).
  5. Model Persistence and Large Data Collections: Currently the models can be trained and stored either on files on disk or in MongoDB databases. To be able to handle large amount of data, other solutions must be investigated. For example MapDB seems like a good candidate for storing data and parameters while training. Moreover it is important to remove any 3rd party libraries that currently handle the persistence of the models and develop a better dry and modular solution.
  6. New algorithms/tests/models: There are so many great techniques that are not currently supported (especially for time series analysis).

Unfortunately all the above are too much work and there is so little time. That is why if you are interested in the project, step forward and give me a hand with any of the above. Moreover I would love to hear from people who have experience in open-sourcing medium-large projects and could provide any tips on how to manage them. Additionally I would be grateful to any brave soul who would dare to look into the code and document some classes or public methods. Last but not least if you use the framework for anything interesting, please drop me a line or share it with a blog post.

 

Finally I would like to thank my love Kyriaki for tolerating me while writing this project, my friend and super-ninja-Java-developer Eleftherios Bampaletakis for helping out with important Java issues and you for getting involved in the project. I’m looking forward to your comments.



Source_link

Related Posts

VirtuaLover Image Generator Pricing & Features Overview
Al, Analytics and Automation

VirtuaLover Image Generator Pricing & Features Overview

March 9, 2026
Al, Analytics and Automation

The ‘Bayesian’ Upgrade: Why Google AI’s New Teaching Method is the Key to LLM Reasoning

March 9, 2026
Pricing Breakdown and Core Feature Overview
Al, Analytics and Automation

Pricing Breakdown and Core Feature Overview

March 9, 2026
Improving AI models’ ability to explain their predictions | MIT News
Al, Analytics and Automation

Improving AI models’ ability to explain their predictions | MIT News

March 9, 2026
Beyond Accuracy: Quantifying the Production Fragility Caused by Excessive, Redundant, and Low-Signal Features in Regression
Al, Analytics and Automation

Beyond Accuracy: Quantifying the Production Fragility Caused by Excessive, Redundant, and Low-Signal Features in Regression

March 9, 2026
Build Semantic Search with LLM Embeddings
Al, Analytics and Automation

Build Semantic Search with LLM Embeddings

March 8, 2026
Next Post
Which Cloud Hosting Should You Choose?

Which Cloud Hosting Should You Choose?

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

AI Catch Video Generator App: Review & Features

AI Catch Video Generator App: Review & Features

December 11, 2025

Kinetiq and NLogic Partner to Advance TV Ad Intelligence

May 27, 2025
EQB redefines challenger banking in Canada with agreement to acquire PC Financial from Loblaw

EQB redefines challenger banking in Canada with agreement to acquire PC Financial from Loblaw

December 5, 2025
Grow a Garden Junkbot Pet Wiki

Grow a Garden Junkbot Pet Wiki

August 10, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Introducing Wednesday Build Hour – Google Developers Blog
  • The Scoop: NYT interview with Nike’s Elliott Hill shows art of CEO profile
  • Binance AI Agents WOTD Answers
  • Dutch intelligence services warn of Russian hackers targeting Signal and WhatsApp
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions