About distributed system, programming and system level stuff.

Reactive Programming - Revisiting Abstraction

Reactive Programming 在現代基於事件驅動程式設計及架構來講,根本上來講以去除副作用 (side-effect) 的 declarative 方式來建構事件的轉換及組合,可以有效降低在 concurrency 下的錯誤和增強組合性 (composability)。這衍伸在工業界如 ReactiveX (RxJava, RxJS, etc)、Reactive Stream Specification 或是如 Future 的建構都有其影子。 然而,他的定義從各個出處仍十分模糊且難以讓人理解,例如: Reactive Programming is a programming with asynchronous data stream. [1] Reactive programming is a declarative programming paradigm concerned with data streams and the propagation of change. [2] Reactive programming is a programming paradigm that is built around the notion of continuous time-varying values and propagation of change. [3] 在加上網路上基於各種感悟和體會的文章衍伸的不嚴謹考究,讓 Reactive Programming 逐漸成為 buzzword。

Note - MBrace: Cloud Computing with Monads

Programming large-scale distributed systems is difficult and requires expert programmers orchestrating concurrent processes across various physical places (nodes), and potentially required to be scalable, resilient, etc. Choosing a well abstraction framework can reduce such efforts and make system performant and resilient. For example, MapReduce Akka CloudHaskell[1] and HdpH[2] However, problematics on MapReduce model: less expressive and not suitable for streaming, iterative and incremental algorithms. (NOTE: the motivation is less persuasive, IMHO)

Local Mac OSX Setup for Engineering Data

紀錄本地端 Mac 對 Hadoop Ecosystem 相關開發的設定。 Hadoop Homebrew 安裝 hadoop 起來我覺得蠻 stable 的,版本也沒問題,以本地開發來講算足夠了。 以 brew 安裝後,會有安裝的 location 和方便使用的 script (start-all.sh) 的位置是需要 tuning 的,基本上設定為 Pseudo Distributed 的模式。 /usr/local/Cellar/hadoop/<version>/libexec/etc/hadoop: configuration files (xml files),設定參考官網即可。 /usr/local/sbin: symlink 的 start-up, tear-down scripts。 所以基本上每次要重開的時候都 refer 到 /usr/local/sbin 找 script 來把 daemon 打開 (e.g. hdfs, yarn, etc.) 然後預設的 Web UI 介面的位置為: Name Node: http://localhost:9870/ Resource Manager: http://localhost:8088/ Spark Spark 基本上本地開發就直接 Link Library 即可,沒特別需要把 build 好的 package 撈下來,以 test driven 的方式 + interactive SBT 是最佳模式。 需要注意的點為有些版本 (Scala or JVM) 會不相容,注意一下主線的需求版本。

Google Summer of Code

This is a post to identify what I’ve done in Google Summer of Code 2018. It’s summarized in brief and non-technical, you can refer to more detail in the following links. Links to what I’ve done: In-detail post about OpenWhisk performance improvement experiment. Main repo/branch for experiment. Invoker agent repo/branch for experiment. Extended multi-actions wrk peformance bench. Commits on OpenWhisk main repo during GSoC progress. The Journey At first, my original GSoC proposal: OpenWhisk performance improvement - work stealing, priority-based scheduling on load balancer and direct connection for streaming capabilities.

OpenWhisk Performance Improvement

OpenWhisk community is nowadays getting more consistent on the new design of architecture for performance improvement. The future architecture of OpenWhisk requires large internal breaking changes. To fill the gap from idea into smooth migration, there might be helpful if a mid ground exist to clear more issues. Hence, I’ve worked on prototyping performance improvement in real, but not that comprehensive though. However, hope this prototype hold a place for deeper discussion and discover all issues might meet in the future.

Distributed System Collections

My personal favor distributed system resources. The maintenance guideline for these resources is keeping it to be small, concise and efficient when reviewing it in the future. Lamport Clock / Vector Clock COS 418 - Lamport Clock COS 418 - Vector Clock Consistency Models Consistency

Distributed Systems for Fun and Profit

Distributed Systems for Fun and Profit 是本分散式系統的書,短小精悍目標是涵蓋所有分散式系統的概念和點出一些關鍵的演算法,然後這是我的筆記。 Basics 基本電腦系統可以分為兩種 task 需要去完成 storage computation 分散式系統的目標是要解決當我們系統 scale up 去處理這些 task ,並且而後發生的種種 trade-off。 Scalability is the ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth Growth 可以從很多面向來看,但最重要相關的觀點用以來量測就是performance and availability Performance (and latency) is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used