learning and sharing: 2014

Tuesday, 21 October 2014

MySQL Community vs Enterprise vs Standard

Trong khoảng 2 năm làm việc của mình, database mình sử dụng thường xuyên nhất là MySQL, trong nhiều dự án khác nhau, vì tính phổ biến và dễ sử dụng.

Câu hỏi mình tự hỏi bản thân mình là tại sao MySQL lại có các phiên bản khác nhau như vậy? (Community - Standard - Enterprise). Có ai từng có câu hỏi tương tự như mình chưa nhỉ?

Thanks to Lynn Ferrante, mình tìm được câu trả lời từ một bài viết của anh ấy.

Community vs Enterprise

2 phiên bản community và enterprise khá là giống nhau, chỉ khác ở chỗ, ở bản enterprise được thêm vào một số tính năng:
- Support : support for mission critical issues and hot fixes.
- Backup : MEB ( MySQL Enterprise Backup ), online InnoDB,MyISAM backups.
- Monitoring Tools : MEM ( MySQL Enterprise Monitor ), Query Analyzer, Web Based configuration Plugins :
- Additional plugins for performance improvement i.e. thread pool, PAM authentication etc

Ta thấy tính năng quan trọng nhất của bản enterprise chính là sử dụng "thread pool" để tối ưu hiệu năng cho database. Vậy thread pool là gì? và tại sao nó làm tăng hiệu năng?

Đây là câu trả lời chính thức từ MySQL (rất đầy đủ và chi tiết):
Bản tóm tắt:
MySQL Thread Pool là plugin dùng để tăng khả năng quản lí kết nối (connection handling) của MySQL server, giới hạn số lượng kết nối đồng thời thực thi statement/query và transaction để đảm bảo mỗi connection đều có CPU và memory để thực hiện công việc của nó.
Bình thường MySQL sử dụng mô hình thread-handling (một connection đến sẽ có một thread xử lí). Do đó, càng nhiều client kết nối đến server và thực thi statement, hiệu suất chung sẽ giảm đi.
Thread Pool được tạo ra để thay thế thread-handling, làm giảm overhead và tăng performance, hiệu quả khi xử lí nhiều thread cho số lượng lớn kết nối, đặc biệt hệ thống mới (multi-CPU/Core system)
Thread Pool sử dụng cách "chia để trị" (devide and conquer) để giới hạn và cân bằng sự đồng thời. Nó tách biệt connection và thread, do đó không có quan hệ gì giữa thread xử lí và connection đến. Nó quản lí client connection theo từng thread group, chúng được đánh dấu độ ưu tiên và xếp hàng dựa vào bản chất của công việc của chúng.

Cơ chế hoạt động:
Thread được tổ chức thành 16 group (mặc định, có thể tăng lên thành 64 group). Mỗi group bắt đầu với 1 thread và có thể tăng lên đến maximum 4096 thread, additional thread được tạo ra khi cần thiết.
Mỗi connection đến được gán vào một group theo cơ chế Round robin. Mỗi group đều có một "listener thread" để lắng nghe yêu cầu kết nối đến.

Khi có yêu cầu kết nối đến group, listener thread của group sẽ thực thi ngay lập tức nếu không bận và không có yêu cầu nào đang đợi. Nếu yêu cầu được thực hiện xong nhanh, listener thread sẽ trở về trạng thái chờ yêu cầu tiếp theo (ngăn chặn việc thêm một thread mới được tạo ra). Nếu yêu cầu đó không kết thúc nhanh (vẫn còn đang xử lí), một thread mới được tạo ra, trở thành lister thread mới.

Nếu listener thread đang bận xử lí, yêu cầu đến được xếp vào hàng đợi. Sẽ có một khoảng thời gian rất ngắn (thread_pool_stall_limit, mặc định là 60ms) ta phải đợi để biết câu lệnh đang thực thi có kết thúc nhanh hay không. Nếu nó hoàn thành nhanh (ít hơn thời gian của thread_pool_stall_limit), ta có thể sử dụng lại thread này cho yêu cầu tiếp theo trong hàng đợi, giúp loại trừ overhead việc tạo thread mới hay có quá nhiều statement ngắn để thực thi đồng thời (parrallel)

Thiết kế của thread pool là cố gắng có một thread thực thi trên một group bất kì lúc nào. Số lượng group (thread_pool_size_variable) là cực kì quan trọng, vì nó xấp xỉ với số lượng short-running statement thực thi đồng thời trong khoảng bất kì lúc nào. Long-running statement được ngăn chặn không gây ra việc đợi cho các statement khác, vì nếu chúng vượt quá thời gian (thread_pool_stall_limit), một thread khác sẽ được tạo ra, yêu cầu tiếp theo trong hàng đợi sẽ thực thi trên đó.

Có 2 loại hàng đợi cho waiting statement, low và high priority.
Hàng đợi low priority (ưu tiên thấp) bao gồm:
- tất cả các statement cho các loại non-transactional storage engine (MyISAM, ISAM)
- tất cả các statement nếu autocommit được enable.
- statement đầu tiên trong InnoDB transaction.
Các câu lệnh trên tất nhiên sẽ không nằm đợi mãi như vậy, vì nó sẽ bị đá đến hàng đợi hàng đợi high priority khi thời gian thread_pool_prio_kickup_timer hết. Tuy nhiên, chỉ có số lượng tối đa số statement được di chuyển trong một khoảng thời gian mà thôi(100 statement/s), đảm bảo mọi thứ trong tầm điều khiển.

Hàng đợi high priority (ưu tiên cao) bao gồm:
- bất kì câu lệnh tiếp theo nào trong InnoDB transaction.
- bất kì câu lệnh nào bị kick khỏi hàng đợi low priority.

Standard vs Enterprise
Bạn có thể thêm thông tin ở đây để biết nó khác nhau chỗ nào ;)

Some information links:
- http://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_thread_pool_algorithm
- http://dev.mysql.com/doc/refman/5.6/en/thread-pool-tuning.html

Tuesday, 7 October 2014

Bandwidth vs Throughput

Đây là 2 từ khóa rất dễ bị hỏi và bắt bẽ trong các buổi interview, hi vọng bài này sẽ giúp các bạn phần nào hiểu được bản chất của 2 từ khóa này.

- Bandwidth: là tốc độ một thiết bị có thể gửi dữ liệu trên một đường truyền nhanh đến mức nào. Một trong những ví dụ kinh điển là đường cao tốc có 4 lane chạy cùng một chiều. Cả 4 lane đều đông xe và di chuyển với tốc độ nhanh nhất , trong trường hợp này đường cao tốc là bandwidth.

- Throughput: có nghĩa là bao nhiêu bit thực sự được truyền đi giữa 2 máy tính, hoặc tỉ lệ trung bình thành công các message được gửi đi qua kênh truyền. Dữ liệu này có thể được truyền qua đường truyền vật lý hay logic. Cũng với ví dụ xe ô-tô lúc nãy, giả sử một chiếc xe đại diện cho một gói tin 64KB, một vòng đi và về của xe trên cùng một tuyến đường trong một đơn vị thời gian (thường là mili giây) gọi là throughput.

Để dễ hiểu hơn nữa, mình sẽ đưa ra một ví dụ qua công thức tính:

Giả sử chúng ta có:

- Tất cả kết nối giữa các ISP đều là 45Mbps
- Tất cả thiết bị đều có TCP window là 64KB (65535 byte)
- Các thiết bị thực hiện testing đều kết nối trực tiếp đến ISP của mỗi vùng
- Chỉ có các traffic của thiết bị thực hiện testing trên đường truyền
- Không có nghẽn mạng giữa các node
- Không có mất gói tin giữa các node
- RTT (round trip time) giữa server (New York) đến client (Chicago) là 30ms
- RTT (round trip time) giữa servẻ (New York) đến client (Japan) là 20ms

Ta có mô hình như trên. Công thức tính throughput của upload/download giữa các node như sau:
TCP throughput NY to Chicago = (64000 bytes x 8) bits / 0.03 second = 17066666 bps = 17Mbps
TCP throughput NY to Japan = (640000 bytes x 8) bits / 0.03 second = 2560000 bps = 2.56Mbps

Đó là TCP throughput lí thuyết dưới điều kiện giả định, cả 2 client ở Chicago và Japan sẽ không thể đạt tới 45Mbps. Do đó, cả 2 chỉ sử dụng tới 37% và 5.6% của bandwidth.

Vậy, với trường hợp thứ nhất (37%) sẽ có thể có 3 client cùng sử dụng chuyển dữ liệu mà không xảy ra overutilization, với trường hợp thứ hai (5.6%) có thể lên đến 17.5 client.

Đến đây có một số bạn sẽ thắc mắc rằng: tại sao lại lấy tcp window chia cho rtt. Mình sẽ có một bài viết sao giải thích tại sao tcp window và tính quan trong của nó.

Friday, 3 October 2014

[GoLang] Warm up for newbies

Rãnh rỗi gần mấy tháng rồi, ngồi đánh Dota hoài cũng chán, nên đua đòi tập tành viết code, đọc GO được gần một tháng rồi mà chưa làm được gì! Buồn thiệt chứ :D

Quyết tâm dâng cao, hôm nay viết một bài giới thiệu về GOLang.

Go go go go!

Tại sao lại là go? - Nói thật mình cũng chả quan tâm nhiều đến nó sẽ làm được gì (performance, concurrency, garbage collection …). Ấn tượng đầu tiên cú pháp khá giống Python (thấy thích rồi - dễ đọc), đọc document một hồi lại thấy giống giống C (thích nữa nha - mặc dù trình độ viết C của mình like shit), lí do cuối cùng là của anh google (chắc là sẽ ngon đây). OK, quất thôi.

Mình đang sử dụng OS là Ubuntu 14.04 LTS, IDE là liteide.

Cài đặt GoLang và IDE

Thật ra cái này nó hướng dẫn chi tiết dữ lắm luôn rồi, bạn nào biết tiếng Anh chút xíu cũng có thể cài đặt ở đây
Vào trang này down bản go mới nhất phù hợp với OS của mình nha.

duym@local:~$ wget https://storage.googleapis.com/golang/go1.3.3.linux-amd64.tar.gz
duym@local:~$ tar xvfz go1.3.3.linux-amd64.tar.gz

Rồi, để đó, bây giờ ta sẽ cài liteide
Vào đây down về bản mới nhất về rồi giải nén.

duym@local:~$ tar xvfj liteidex23.2.linux-64.tar.bz2

Xong, có compiler, có ide rồi, bây giờ cấu hình chút xíu nhìn cho nó chuyên nghiệp nào :)
Tạo một workspace để lưu các project ta sẽ thực hiện sau này.

duym@local:~$ mkdir go_repo
duym@local:~$ cd go_repo
duym@local:~/go_repo$ mkdir src pkg src bin

Chúng ta tạo ra 3 thư mục: src, pkg, bin. Tại sao?
+ src: chứa source file (.go, .c, .g, .s) của các project chúng ta sẽ thực hiện sau này
+ pkg: chứa compiled file .a (hiểu nôm na như library, ok?)
+ bin: chưa file thực thi

Tùy chỉnh biến môi trường, ta sẽ edit file $HOME/.bashrc (ở đây mình sử dụng Ubuntu, các bạn sử dụng OS khác thì tự mò nhá, mò không ra thì mailcho mình)

export GOPATH=$HOME/go_repo
export GOBIN=$GOPATH/bin
export GOROOT=$HOME/go
export PATH=$PATH:$GOROOT/bin:$HOME/liteide/bin

Lưu lại và refresh lại biến môi trường nào.

duym@local:~$ . .bashrc

Check lại xem nào:
duym@local:~/$ go env
GOARCH="amd64"
GOBIN="/home/duym/go_repo/bin"
GOCHAR="6"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/duym/go_repo"
GORACE=""
GOROOT="/home/duym/go"
GOTOOLDIR="/home/duym/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

Ngon, nhưng dừng lại giải thích các biến môi trường phía trên chút xíu:
- GOPATH: là workspace mà chúng ta lúc nãy đã tạo (có chứa thư mục: src, bin, pkg)
- GOBIN: nơi lưu các file excute, sau khi đã build từ src và pkg
- GOROOT: thư mục cài đặt Go

OK, bây giờ chạy liteide lên và “hello world” nào :D

duym@local:~$ liteide &

Sorry all, đáng lẽ tới bước này thì mình sẽ chụp hình cho các bạn trực quan hơn, nhưng mà mình lười quá :D nên mình sẽ post clip hướng dẫn “chạy helloworld” link này nha.

Thursday, 2 October 2014

Using Two Factor Google Authenticator

Với gmail hay hotmail, hiện tại, đã hỗ trợ sử dụng xác thực 2 bước để login.

Hôm nay mình sẽ hướng dẫn mọi người sử dụng google authenticator để tạo xác thực 2 bước trên desktop hay ssh trên Linux.
2 bước xác thực là gì? Là hành động yêu cầu xác nhận danh tính của người sử dụng thông qua 2 bước:
+ bước 1: mặc định là password (có thể đổi thành key)
+ bước 2: sử dụng 1 chuỗi số được tạo ra từ google authenticator app (theo thuật toán TOTP - Time-based One-Time Password)

Yêu cầu đầu tiên chúng ta phải có smartphone :) Hướng dẫn cài đặt google authenticator lên smart phone Install Google Authenticator

1. Cài đặt
CentOS
# sudo yum install google-authenticator

Ubuntu
# sudo apt-get install libpam-google-authenticator

2. Cấu hình

# google-authenticator

Trên màn hình console sẽ hiện ra một bản QRCode, chúng ta sử dụng điện thoại đã cài đặt authenticator lúc nãy, quét QRCode này.( OK, bây giờ chúng ta đã có được password để xác thực ở bước 2)

Đây là các câu hỏi sẽ hiện ra cùng với QRCode để chúng ta có thể tăng tính bảo mật hơn. Đề nghị các bạn nên gõ y (yes) và Enter hết :D

Do you want me to update your "/home/duym/.google_authenticator" file (y/n) y

=> Nơi lưu trữ secret key và các thông số cấu hình.

Do you want to disallow multiple uses of the same authentication token? This restricts you to one login about every 30s, but it increases your chances to notice or even prevent man-in-the-middle attacks (y/n) y

=> Bạn có cho phép tắt việc sử dụng nhiều lần cho cùng 1 token hay không.

By default, tokens are good for 30 seconds and in order to compensate for possible time-skew between the client and the server, we allow an extra token before and after the current time. If you experience problems with poor time synchronization, you can increase the window from its default size of 1:30min to about 4min. Do you want to do so (y/n) y

=> Bạn có muốn tăng thời gian khả dụng của token hay không? Mặc định là 1p30s, ở đây nếu trả lời là yes thì thời gian đó sẽ tăng lên thành 4phút

If the computer that you are logging into isn't hardened against brute-force login attempts, you can enable rate-limiting for the authentication module. By default, this limits attackers to no more than 3 login attempts every 30s. Do you want to enable rate-limiting (y/n) y

=> Bạn có muốn bật chức năng giới hạn số lần login lên không? Ở đây là 3 lần trong 30s.

Zậy là xong. Ta sẽ thực hiện cấu hình server/desktop để có thể sử dụng được chức năng này.

Cấu hình cho sshd có thể sử dụng chức năng này, ta thực hiện:
# sudo echo "auth required pam_google_authenticator.so" >> /etc/pam.d/sshd
Edit file /etc/ssh/sshd_config
# ChallengeResponseAuthentication yes
Restart lại ssh service
# sudo service ssh restart

Cấu hình cho desktop sử dụng chức năng này:
Ubuntu:
# sudo echo "auth required pam_google_authenticator.so" >> /etc/pam.d/lightdm
# sudo echo "auth required pam_google_authenticator.so" >> /etc/pam.d/gnome-screensaver

Fedora:
# sudo echo "auth required pam_google_authenticator.so" >> /etc/pam.d/gdm-password

OK. Xong! Bây giờ là test lại:
# ssh localhost
Password:
Verification code:
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-37-generic x86_64)

Have fun.

Thursday, 25 September 2014

Thinking is drug

Thanks to Guz Alexander, i finally find myself why i'm weird! Great.

"When someone mentions addiction there is always something negative about it. But I think not every kind of addiction deserve to be censured, moreover possessing some particular psychological addiction is almost necessary for a programmer. I'm talking about thinking.

Yes, it may sound strange because thinking is the same part of out lives as, for example, breathing or eating. Though pay attention to the fact that almost in every programmer's vacancy there is an item in requirements list "desire to learn new things, constant developing", but no one mentions that you have to like to breathe or to eat. Or did I miss something?

This point (about desire) to my opinion is not necessary at all. Why not? Try to ask any programmer why he became "this weird silent guy". May be he will tell you something like "I want a lot of money and a cool car", but this is not surprising at all - we're already considered to be a little strange, no need to make it worse and tell the truth. And the truth is simple: permanent process of thinking, solving difficult tasks, learning new things is the best remedy for boredom. We are addicted to thinking. And I like it."

Thursday, 18 September 2014

XOR

Today, i get a question about RAID 4 from my friend. Interesting point is how to recover data from 1 death disk?

From here => http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_4

I figure out the way it works.

example, we had:

A x B x C x D = E

problem: A was lost, how to recover A?

solve:

i had:

A x A = 0

# x 0 = #

# x 1 = !#

so:

A x B x C x D x B x C x D = E x B x C x D

<=> A = E x B x C x D

we got A.

Wednesday, 17 September 2014

Understand SELinux in easy way

Thanks to Daniel J Walsh, I got the overview of SELinux. http://opensource.com/business/13/11/selinux-policy-guide

Thursday, 28 August 2014

Why use service better than /etc/init.d ?

As sysadmin, i wonder myself about service and /etc/init.d, both are used for controlling process. Thanks to CDH (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Quick-Start/cdh5qs_prereq.html) I found answer.

This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the /etc/init.d scripts directly, any environment variables you have set remain in force, and could produce unpredictable results

Monday, 23 June 2014

Find all subdomain of domain name

1. Use Wolfram Alpha

2. Use Google
http://www.google.com/search?num=100&q=site:*.domain.name/

3. Use pentest-tool
https://pentest-tools.com/reconnaissance/find-subdomains-of-domain

Thursday, 12 June 2014

Object storage and Block Storage

Chapter 6. Storage Decisions

Storage is found in many parts of the OpenStack stack, and the differing types can cause confusion to even experienced cloud engineers. This section focuses on persistent storage options you can configure with your cloud. It's important to understand the distinction between ephemeral storage and persistent storage.

Ephemeral Storage

If you deploy only the OpenStack Compute Service (nova), your users do not have access to any form of persistent storage by default. The disks associated with VMs are "ephemeral," meaning that (from the user's point of view) they effectively disappear when a virtual machine is terminated.

Persistent Storage

Persistent storage means that the storage resource outlives any other resource and is always available, regardless of the state of a running instance.

Today, OpenStack clouds explicitly support two types of persistent storage: object storage and block storage.

Object Storage

With object storage, users access binary objects through a REST API. You may be familiar with Amazon S3, which is a well-known example of an object storage system. Object storage is implemented in OpenStack by the OpenStack Object Storage (swift) project. If your intended users need to archive or manage large datasets, you want to provide them with object storage. In addition, OpenStack can store your virtual machine (VM) images inside of an object storage system, as an alternative to storing the images on a file system.

OpenStack Object Storage provides a highly scalable, highly available storage solution by relaxing some of the constraints of traditional file systems. In designing and procuring for such a cluster, it is important to understand some key concepts about its operation. Essentially, this type of storage is built on the idea that all storage hardware fails, at every level, at some point. Infrequently encountered failures that would hamstring other storage systems, such as issues taking down RAID cards or entire servers, are handled gracefully with OpenStack Object Storage.

A good document describing the Object Storage architecture is found within the developer documentation—read this first. Once you understand the architecture, you should know what a proxy server does and how zones work. However, some important points are often missed at first glance.

When designing your cluster, you must consider durability and availability. Understand that the predominant source of these is the spread and placement of your data, rather than the reliability of the hardware. Consider the default value of the number of replicas, which is three. This means that before an object is marked as having been written, at least two copies exist—in case a single server fails to write, the third copy may or may not yet exist when the write operation initially returns. Altering this number increases the robustness of your data, but reduces the amount of storage you have available. Next, look at the placement of your servers. Consider spreading them widely throughout your data center's network and power-failure zones. Is a zone a rack, a server, or a disk?

Object Storage's network patterns might seem unfamiliar at first. Consider these main traffic flows:

Among object, container, and account servers
Between those servers and the proxies
Between the proxies and your users

Object Storage is very "chatty" among servers hosting data—even a small cluster does megabytes/second of traffic, which is predominantly, “Do you have the object?”/“Yes I have the object!” Of course, if the answer to the aforementioned question is negative or the request times out, replication of the object begins.

Consider the scenario where an entire server fails and 24 TB of data needs to be transferred "immediately" to remain at three copies—this can put significant load on the network.

Another fact that's often forgotten is that when a new file is being uploaded, the proxy server must write out as many streams as there are replicas—giving a multiple of network traffic. For a three-replica cluster, 10 Gbps in means 30 Gbps out. Combining this with the previous high bandwidth demands of replication is what results in the recommendation that your private network be of significantly higher bandwidth than your public need be. Oh, and OpenStack Object Storage communicates internally with unencrypted, unauthenticated rsync for performance—you do want the private network to be private.

The remaining point on bandwidth is the public-facing portion. The swift-proxy service is stateless, which means that you can easily add more and use HTTP load-balancing methods to share bandwidth and availability between them.

More proxies means more bandwidth, if your storage can keep up.

Block Storage

Block storage (sometimes referred to as volume storage) provides users with access to block-storage devices. Users interact with block storage by attaching volumes to their running VM instances.

These volumes are persistent: they can be detached from one instance and re-attached to another, and the data remains intact. Block storage is implemented in OpenStack by the OpenStack Block Storage (cinder) project, which supports multiple backends in the form of drivers. Your choice of a storage backend must be supported by a Block Storage driver.

Most block storage drivers allow the instance to have direct access to the underlying storage hardware's block device. This helps increase the overall read/write IO.

Experimental support for utilizing files as volumes began in the Folsom release. This initially started as a reference driver for using NFS with cinder. By Grizzly's release, this has expanded into a full NFS driver as well as a GlusterFS driver.

These drivers work a little differently than a traditional "block" storage driver. On an NFS or GlusterFS file system, a single file is created and then mapped as a "virtual" volume into the instance. This mapping/translation is similar to how OpenStack utilizes QEMU's file-based virtual machines stored in/var/lib/nova/instances.

OpenStack Storage Concepts

Table 6.1, “OpenStack storage” explains the different storage concepts provided by OpenStack.

Table 6.1. OpenStack storage
	Ephemeral storage	Block storage	Object storage
Used to…	Run operating system and scratch space	Add additional persistent storage to a virtual machine (VM)	Store data, including VM images
Accessed through…	A file system	A block device that can be partitioned, formatted, and mounted (such as, /dev/vdc)	The REST API
Accessible from…	Within a VM	Within a VM	Anywhere
Managed by…	OpenStack Compute (nova)	OpenStack Block Storage (cinder)	OpenStack Object Storage (swift)
Persists until…	VM is terminated	Deleted by user	Deleted by user
Sizing determined by…	Administrator configuration of size settings, known as flavors	User specification in initial request	Amount of available physical storage
Example of typical usage…	10 GB first disk, 30 GB second disk	1 TB disk	10s of TBs of dataset storage

File-level Storage (for Live Migration)

With file-level storage, users access stored data using the operating system's file system interface. Most users, if they have used a network storage solution before, have encountered this form of networked storage. In the Unix world, the most common form of this is NFS. In the Windows world, the most common form is called CIFS (previously, SMB).

OpenStack clouds do not present file-level storage to end users. However, it is important to consider file-level storage for storing instances under/var/lib/nova/instances when designing your cloud, since you must have a shared file system if you want to support live migration.

Choosing Storage Backends

Commodity Storage Backend Technologies

Users will indicate different needs for their cloud use cases. Some may need fast access to many objects that do not change often, or want to set a time-to-live (TTL) value on a file. Others may access only storage that is mounted with the file system itself, but want it to be replicated instantly when starting a new instance. For other systems, ephemeral storage—storage that is released when a VM attached to it is shut down— is the preferred way. When you select storage backends,ask the following questions on behalf of your users:

Do my users need block storage?
Do my users need object storage?
Do I need to support live migration?
Should my persistent storage drives be contained in my compute nodes, or should I use external storage?
What is the platter count I can achieve? Do more spindles result in better I/O despite network access?
Which one results in the best cost-performance scenario I'm aiming for?
How do I manage the storage operationally?
How redundant and distributed is the storage? What happens if a storage node fails? To what extent can it mitigate my data-loss disaster scenarios?

To deploy your storage by using only commodity hardware, you can use a number of open-source packages, as shown in Table 6.2, “Persistent file-based storage support”.

Table 6.2. Persistent file-based storage support
	Object	Block	File-level
Swift
LVM
Ceph			Experimental
Gluster
NFS
ZFS
Sheepdog
This list of open source file-level shared storage solutions is not exhaustive; other open source solutions exist (MooseFS). Your organization may already have deployed a file-level shared storage solution that you can use.

Also, you need to decide whether you want to support object storage in your cloud. The two common use cases for providing object storage in a compute cloud are:

To provide users with a persistent storage mechanism
As a scalable, reliable data store for virtual machine images

Commodity Storage Backend Technologies

This section provides a high-level overview of the differences among the different commodity storage backend technologies. Depending on your cloud user's needs, you can implement one or many of these technologies in different combinations:

OpenStack Object Storage (swift)

The official OpenStack Object Store implementation. It is a mature technology that has been used for several years in production by Rackspace as the technology behind Rackspace Cloud Files. As it is highly scalable, it is well-suited to managing petabytes of storage. OpenStack Object Storage's advantages are better integration with OpenStack (integrates with OpenStack Identity, works with the OpenStack dashboard interface) and better support for multiple data center deployment through support of asynchronous eventual consistency replication.

Therefore, if you eventually plan on distributing your storage cluster across multiple data centers, if you need unified accounts for your users for both compute and object storage, or if you want to control your object storage with the OpenStack dashboard, you should consider OpenStack Object Storage. More detail can be found about OpenStack Object Storage in the section below.

Ceph

A scalable storage solution that replicates data across commodity storage nodes. Ceph was originally developed by one of the founders of DreamHost and is currently used in production there.

Ceph was designed to expose different types of storage interfaces to the end user: it supports object storage, block storage, and file-system interfaces, although the file-system interface is not yet considered production-ready. Ceph supports the same API as swift for object storage and can be used as a backend for cinder block storage as well as backend storage for glance images. Ceph supports "thin provisioning," implemented using copy-on-write.

This can be useful when booting from volume because a new volume can be provisioned very quickly. Ceph also supports keystone-based authentication (as of version 0.56), so it can be a seamless swap in for the default OpenStack swift implementation.

Ceph's advantages are that it gives the administrator more fine-grained control over data distribution and replication strategies, enables you to consolidate your object and block storage, enables very fast provisioning of boot-from-volume instances using thin provisioning, and supports a distributed file-system interface, though this interface is not yet recommended for use in production deployment by the Ceph project.

If you want to manage your object and block storage within a single system, or if you want to support fast boot-from-volume, you should consider Ceph.

Gluster

A distributed, shared file system. As of Gluster version 3.3, you can use Gluster to consolidate your object storage and file storage into one unified file and object storage solution, which is called Gluster For OpenStack (GFO). GFO uses a customized version of swift that enables Gluster to be used as the backend storage.

The main reason to use GFO rather than regular swift is if you also want to support a distributed file system, either to support shared storage live migration or to provide it as a separate service to your end users. If you want to manage your object and file storage within a single system, you should consider GFO.

LVM

The Logical Volume Manager is a Linux-based system that provides an abstraction layer on top of physical disks to expose logical volumes to the operating system. The LVM backend implements block storage as LVM logical partitions.

On each host that will house block storage, an administrator must initially create a volume group dedicated to Block Storage volumes. Blocks are created from LVM logical volumes.

	Note
	LVM does not provide any replication. Typically, administrators configure RAID on nodes that use LVM as block storage to protect against failures of individual hard drives. However, RAID does not protect against a failure of the entire host.

ZFS

The Solaris iSCSI driver for OpenStack Block Storage implements blocks as ZFS entities. ZFS is a file system that also has the functionality of a volume manager. This is unlike on a Linux system, where there is a separation of volume manager (LVM) and file system (such as, ext3, ext4, xfs, and btrfs). ZFS has a number of advantages over ext4, including improved data-integrity checking.

The ZFS backend for OpenStack Block Storage supports only Solaris-based systems, such as Illumos. While there is a Linux port of ZFS, it is not included in any of the standard Linux distributions, and it has not been tested with OpenStack Block Storage. As with LVM, ZFS does not provide replication across hosts on its own; you need to add a replication solution on top of ZFS if your cloud needs to be able to handle storage-node failures.

We don't recommend ZFS unless you have previous experience with deploying it, since the ZFS backend for Block Storage requires a Solaris-based operating system, and we assume that your experience is primarily with Linux-based systems.

Conclusion

We hope that you now have some considerations in mind and questions to ask your future cloud users about their storage use cases. As you can see, your storage decisions will also influence your network design for performance and security needs. Continue with us to make more informed decisions about your OpenStack cloud design.

Monday, 9 June 2014

What is Direct-IO

A file is simply a collection of data stored on media. When a process wants to access data from a file, the operating system brings the data into main memory, the process reads it, alters it and stores to the disk . The operating system could read and write data directly to and from the disk for each request, but the response time and throughput would be poor due to slow disk access times. The operating system therefore attempts to minimize the frequency of disk accesses by buffering data in main memory, within a structure called the file buffer cache.

Certain applications derive no benefit from the file buffer cache. Databases normally manage data caching at the application level, so they do not need the file system to implement this service for them. The use of a file buffer cache results in undesirable overheads in such cases, since data is first moved from the disk to the file buffer cache and from there to the application buffer. This “doublecopying” of data results in more CPU consumption and adds overhead to the memory too.

For applications that wish to bypass the buffering of memory within the file system cache, Direct I/O is provided. When Direct I/O is used for a file, data is transferred directly from the disk to the application buffer, without the use of the file buffer cache. Direct I/O can be used for a file either by mounting the corresponding file system with the direct i/o option (options differs for each OS), or by opening the file with the O_DIRECT flag specified in the open() system call. Direct I/O benefits applications by reducing CPU consumption and eliminating the overhead of copying data twice – first between the disk and the file buffer cache, and then from the file.

However, there are also few performance impacts when direct i/o is used. Direct I/O bypasses filesystem read-ahead – so there will be a performance impact.

Hardware RAID, Software RAID, ZFS (st)

Q: Which is best, hardware RAID, vs Software RAID, vs ZFS ?
A: Short answer: If your drives are 500GB or less, pick whichever you like / can afford. If you are using drives larger than 500GB use ZFS. No matter what you decide to use, remember to backup your data!

Still have questions? Read on:

If you have been a system administrator for any reasonable length of time you've had a server with a hardware RAID controller report a failed disk. When replacing the failed disk with a new one, another drive fails during the rebuild, the system crashes and you have no choice but to restore your data from backup. This will happen due to the controller being bad or more likely the silent corruption of data on one of the other disks in the array. Silent corruption of data, colloquially referred to as "bit rot", is more frequent these days as hard drives store larger amounts of data. I have personally seen this happen on several occasions over the last 20 years, most frequently within the last 5, so it is natural to wonder if the pros of hardware RAID really outweigh the cons for a general purpose server or NAS device.

What follows is not meant to be the last word on this subject nor a scholarly article. My goal is merely to help NAS4Free users make a fast, informed decision by collecting the most important information from the last few years in one place. When you are done reading everything here you should know exactly why you have chosen one solution over the others. Different people have different needs and different budgets, there is no single solution which is right for everyone, but it is possible to arrive at some general recommendations that will be best for most people.Those recommendations are the short answer above. If you are looking for “How To” type answers or have specific questions about configuring or managing SoftRAID or ZFS arrays you should look elsewhere in the Wiki.

Pros of Hardware RAID

Improves server up-time statistics and availability which might be negatively impacted by common, predictable disk failures when configured and monitored properly.
Easy to set up – Some controllers have a menu driven wizard to guide you through building your array or automatically set up right out of the box.
Easy to replace disks – If a disk fails just pull it out and replace it with a new one assuming all other hardware is hot-plug compatible.
Performance improvements (sometimes) – If you are using an underpowered CPU, a HW RAID card with a decent cache and BBU (write backs enabled of course) will generate significantly better throughput. However this will only be a benefit if you are running tremendously intense workloads (transferring data faster than 1Gbps or hosting apps doing constant reads/write), the typical NAS4Free server is not used in such a way, though it could be.
Some controllers have battery backups that help reduce the possibility of data damage/loss during a power outage.

Cons of Hardware RAID

Proprietary – Minimal or complete lack of detailed hardware and software specifications/instructions. You are stuck with this vendor.
On-Disk meta data can make it near impossible to recover data without a compatible RAID card – If your controller dies you’ll have to find a compatible model to replace it with, your disks won’t be useful without the controller. This is especially bad if working with a discontinued model that has failed after years of operation.
Monitoring applications are all over the road – Depending on the vendor and model the ability and interface to monitor the health and performance of your array varies greatly. Often you are tied to specific piece of software that the vendor provides.
Additional cost – Hardware RAID cards cost more than standard disk controllers. High end models can be very expensive.
Lack of standardization between controllers (configuration, management, etc)- The keystrokes, nomenclature and software that you use to manage and monitor one card likely won’t work on another.
Inflexible – Your ability to create, mix and match different types of arrays and disks and perform other maintenance tasks varies tremendously with each card.
Hardware RAID is not backup, if you disagree then you need to read the definition again. Even the most expensive Hardware RAID is so much junk without a complete and secure backup of your data which would be necessary following a catastrophic, unpredictable failure.

Pros of Software RAID

ZFS and SoftRAID can improve server up-time statistics and availability which might be negatively impacted by common, predictable disk failures when configured and monitored properly.
ZFS and similar (like BtrFS) were deliberately developed as software systems and for the most part do not benefit from dedicated hardware controllers.
Development and innovation in ZFS, BtrvFS and others is active and ongoing.
Hardware independent – RAID is implemented in the OS which keeps things consistent regardless of the hardware manufacturer.
Easy recovery in case of MB or controller failure - With software RAID you can swap the drives to a different server and read the data. There is no vendor lock-in with a software RAID solution.
Standardized RAID configuration (for each OS) – The management toolkit is OS specific rather than specific to each individual type of RAID card.
Standardized monitoring (for each OS) – Since the toolkit is consistent you can expect to monitor the health of each of your servers using the same methods.
Removing and replacing a failed disk is typically very simple and you may not even have to reboot the server if all your hardware is hot-plug compatible.
Good performance – CPUs just keep getting faster and faster. Unless you are performing tremendous amounts of IO the extra cost just doesn’t seem worthwhile. Even then, it takes milliseconds to read/write data, but only nanoseconds to compute parity, so until drives can read/write much faster, the overhead is minimal for a modern processor.
Very flexible – Software RAID allows you to reconfigure your arrays in ways that are generally not possible with hardware controllers. Of course, not all configurations are useful, some can be downright dumb.

Cons of Software RAID

Need to learn the software RAID tool set for each OS. – These tools are often well documented but often more difficult to learn/understand as their hardware counterparts. Flexibility does have it's price.
Possibly slower performance than dedicated hardware – A high end dedicated RAID card might match or outperform software RAID if you are willing to spend over $500.00 for it.
Development of SoftRAID0 through 6 has pretty much stopped as focus moves to newer technologies like ZFS and BtrvFS.
Additional load on CPU – RAID operations have to be calculated somewhere and in software this will run on your CPU instead of dedicated hardware. While I have yet to see real-world performance degradation it is possible some servers/applications may be more or less affected. As you add more drives/arrays you should keep in mind the minimum processor/RAM needs for the server as a whole. You need to test your particular use case to ensure everything works well.
Additional RAM requirements - particularly if using ZFS and some of its high end features, but not for plain old SoftRAID.
Susceptible to power outages - A UPS is strongly recommended when using SoftRAID as data loss or damage could occur if power is lost. ZFS is less susceptible, but I would still recommend a UPS.
Disk replacement sometimes requires prep work - You typically should tell the software RAID system to stop using a disk before you yank it out of the system. I’ve seen systems panic when a disk was physically removed before being logically removed. This is of course minimized if all your hardware is hot-plug compatible.
ZFS and Software RAID are not backup, if you disagree then you need to read the definition again. Even the largest and cheapest SoftRAID is so much junk without a complete and secure backup of your data which would be necessary following a catastrophic, unpredictable failure.

It will be pretty obvious to most people after weighing the pros and cons above that SoftRAID is the better choice for most NAS4Free users. This is true based on price/performance even before we consider stability and longevity of your data.

Now that we've established Software RAID is the way to go, what do you choose? SoftRAID1, SoftRAID5, 6? Maybe a hybrid SoftRAID 10 or 50? What about ZFS?

As drive capacities increase so does silent data corruption or "bit rot". You must understand what this is to properly assess the odds of a catastrophic loss of data happening to you. Silent data corruption is not a new concept, it has been well discussed for years that RAID5 is pretty useless and RAID6 will soon be in Why RAID 5 stops working in 2009, also archived here. Read the article, it's about 2 pages and easy to understand even though you have to use powers of 10 to do the math. The take away is don't waste your money or your time on RAID5 or RAID6 if you really care about your data. Personally I have never trusted my data to either.

Since you now understand data integrity can no longer be improved with RAID5 or 6 you are left with just a few choices.

Hybrid SoftRAID - Not really a good option due to cost and management issues for most people, but still valuable in certain specific scenarios.
SoftRAID1 - I have always used SoftRAID1 with a good backup and it has always been cheap and easy to manage.
ZFS - Has a big advantage over plain old SoftRAID, data integrity is built in from the start while SoftRAID has none. Data integrity is the big advantage of ZFS and to learn more about it you should read A Conversation with Jeff Bonwick and Bill Moore - The future of file systems also archived here.

Now that we are starting to use 3TB drives on a regular basis, bit rot becomes an even bigger issue.
How big? Well, the science is not yet fully settled because testing this stuff is so boring. You should read Keeping Bits Safe: How Hard Can It Be? by David S.H. Rosenthal, published in the monthly magazine of the Association for Computing Machinery (ACM). I will summarize best as I can: A NetApp study encompassing data collected over 41 months from NetApp’s filers in the field, covering more than 1.5 million drives found more than 400,000 silent corruption incidents. A CERN study writing and reading over 88,000TB of data within a 6 month period found .00017TB was silently corrupted. On average, it is not unreasonable to expect roughly 3 corrupt files on a 1TB drive according to another CERN study; CERN's data corruption research also archived here. Yes, the numbers appear small, but these studies were conducted in 2007/08, before we had 3TB hard drives. The numbers have not gotten better, they get worse with increased capacities and longer time frames.

Stop and think about it, when you had 400GB of data spread among a bunch of 500GB drives, how likely was it you would lose everything? Now that you have 2.5TB of data spread among a bunch of 3TB drives, how likely is it you could lose everything? Knowing that MTBFs and error rates have not improved, yet capacities have increased 6X or more, you've got to ask yourself one question, “Do I feel lucky?”.

The bigger and older your drives are the more likely you are to suffer a failure or silent corruption. So save your time and money, data integrity is only one of the many benefits of ZFS, choose it from the start and avoid problems later. Don't forget you still need a backup even if you are using ZFS.

Monday, 12 May 2014

Persistent Database Connections

Persistent connections are links that do not close when the execution of your script ends. When a persistent connection is requested, PHP checks if there's already an identical persistent connection (that remained open from earlier) - and if it exists, it uses it. If it does not exist, it creates the link. An 'identical' connection is a connection that was opened to the same host, with the same username and the same password (where applicable).

People who aren't thoroughly familiar with the way web servers work and distribute the load may mistake persistent connects for what they're not. In particular, they do not give you an ability to open 'user sessions' on the same link, they donot give you an ability to build up a transaction efficiently, and they don't do a whole lot of other things. In fact, to be extremely clear about the subject, persistent connections don't give you any functionality that wasn't possible with their non-persistent brothers.

Why?

This has to do with the way web servers work. There are three ways in which your web server can utilize PHP to generate web pages.

The first method is to use PHP as a CGI "wrapper". When run this way, an instance of the PHP interpreter is created and destroyed for every page request (for a PHP page) to your web server. Because it is destroyed after every request, any resources that it acquires (such as a link to an SQL database server) are closed when it is destroyed. In this case, you do not gain anything from trying to use persistent connections -- they simply don't persist.

The second, and most popular, method is to run PHP as a module in a multiprocess web server, which currently only includes Apache. A multiprocess server typically has one process (the parent) which coordinates a set of processes (its children) who actually do the work of serving up web pages. When a request comes in from a client, it is handed off to one of the children that is not already serving another client. This means that when the same client makes a second request to the server, it may be served by a different child process than the first time. When opening a persistent connection, every following page requesting SQL services can reuse the same established connection to the SQL server.

The last method is to use PHP as a plug-in for a multithreaded web server. Currently PHP 4 has support for ISAPI, WSAPI, and NSAPI (on Windows), which all allow PHP to be used as a plug-in on multithreaded servers like Netscape FastTrack (iPlanet), Microsoft's Internet Information Server (IIS), and O'Reilly's WebSite Pro. The behavior is essentially the same as for the multiprocess model described before.

If persistent connections don't have any added functionality, what are they good for?

The answer here is extremely simple -- efficiency. Persistent connections are good if the overhead to create a link to your SQL server is high. Whether or not this overhead is really high depends on many factors. Like, what kind of database it is, whether or not it sits on the same computer on which your web server sits, how loaded the machine the SQL server sits on is and so forth. The bottom line is that if that connection overhead is high, persistent connections help you considerably. They cause the child process to simply connect only once for its entire lifespan, instead of every time it processes a page that requires connecting to the SQL server. This means that for every child that opened a persistent connection will have its own open persistent connection to the server. For example, if you had 20 different child processes that ran a script that made a persistent connection to your SQL server, you'd have 20 different connections to the SQL server, one from each child.

Note, however, that this can have some drawbacks if you are using a database with connection limits that are exceeded by persistent child connections. If your database has a limit of 16 simultaneous connections, and in the course of a busy server session, 17 child threads attempt to connect, one will not be able to. If there are bugs in your scripts which do not allow the connections to shut down (such as infinite loops), the database with only 16 connections may be rapidly swamped. Check your database documentation for information on handling abandoned or idle connections.

Warning

There are a couple of additional caveats to keep in mind when using persistent connections. One is that when using table locking on a persistent connection, if the script for whatever reason cannot release the lock, then subsequent scripts using the same connection will block indefinitely and may require that you either restart the httpd server or the database server. Another is that when using transactions, a transaction block will also carry over to the next script which uses that connection if script execution ends before the transaction block does. In either case, you can use register_shutdown_function() to register a simple cleanup function to unlock your tables or roll back your transactions. Better yet, avoid the problem entirely by not using persistent connections in scripts which use table locks or transactions (you can still use them elsewhere).

An important summary. Persistent connections were designed to have one-to-one mapping to regular connections. That means that you should always be able to replace persistent connections with non-persistent connections, and it won't change the way your script behaves. It may (and probably will) change the efficiency of the script, but not its behavior!

(st)