The Husky project aims to build a more expressive and most importantly, more efficient system for distributed data analytics.
Download Husky (core components)**This academic version provides basic but essential features such as fine-grained object interaction, coarse-grained transformation (e.g., map, reduce), iterative graph analytics and asynchronous machine learning. Husky is open-source in GitHub.
Husky has a highly optimized backend. We always try to exploit more performance from limited computing resources.
Besides, the Husky computing model allows more efficient algorithms to be programmed.
Husky supports a variety of applications, including MapReduce-style ETL, text mining, graph analytics, machine learning(both synchronous and asynchronous).
Furthermore, Husky can cooperate seamlessly with many existing systems, and integrate nicely in the Hadoop eco-system. We are working to support more external systems.
Husky exposes an easy-to-use interface, allowing complicated logic to be expressed succinctly.
Husky can easily work with many existing data stores. With the Python/Scala frontends, even non-technical people can get started with Husky quickly, and enjoy all the benefits.
Example 1: Use PyHusky (Husky Python API) to extract a graph from a big json file, compute its PageRank, and get the top k nodes:
import json
import bindings.frontend as ph
import bindings.frontend.library.graph as hg
edgelist = ph.env.load("/path/to/json") \
.map(lambda line:json.loads(line)) \
.map(lambda doc:(doc['src'], doc['dst']))
graph = hg.from_edgelist(edgelist)
graph.pagerank(iter=10)
graph.pagerank_topk(k=2)
Example 2: Use ScHusky (Husky Scala API) to solve collaborative filtering with Asynchronous SGD:
import schusky.api._
import schusky.lib.ml._
import play.api.libs.json._
val env = new Environment()
val data = env.mongodb("mongohost", 9000).load("db.collection").map { doc =>
val js = Json.parse(doc)
((js\"user").as[Int], (js\"item").as[Int], (js\"rating").as[Double])
}
val model = Homad.from(Matrix.fromTriplet(data))
model.setReg(0.01).setLearningRate(0.1).mfasync(time=20)
println(model.getRMSE)