May 302011

Two readers (HWM and Amateur Professional) have rightfully called me out that my recent article–“Ten things every programmer must know ”–provides only a list of items to study without successfully conveying the importance of learning those items. These readers make a very valid point indeed and the purpose of this post is to provide the motivation. The short answer is that the importance of efficient code has increased in recent years and it is hard to write efficient code without knowing hardware.

Update 6/19/2011: I have written a computer science self-assessment for developers to test their knowledge about computer science.

Continue reading “Quick Post: Why do programmers need to understand hardware?” »

May 302011

In my computer architecture course, I was told that a processor is like a car. The steering and the pedals are the ISA, the engine is the microarchitecture, and the software is the driver. I will extend that analogy further, and say that using a compiler is like using a remote-controller to operate the car. While remote controls are great, it is also important to understand their inner-workings. I see a lot of code –in professional software– that can confuse even the smartest compilers. In this post I cover three common tricks to confuse a compiler.

Continue reading “How to trick C/C++ compilers into generating terrible code?” »

May 292011

To maximize concurrency, all threads should be programmed to complete their work at the same time. Balancing the load among threads requires the programmers to predict the latency of each task, which is often impossible due to unpredictable OS/hardware effects.  Consequently, programmers split the work into small tasks and use work-queues to distribute the work dynamically. Work-queues exist in many programming paradigms like Grand Central Dispatch, Intel TBB, Cilk, Open MP, etc. While work-queues improve load-balancing, they introduce the overhead of adding and removing tasks to and from the work-queue. Thus, if each individual task is too small, the work-queue overhead becomes prohibitive and if its too long then there is risk of imbalance. This post (1) analyzes these trade-offs, (2) provides a method for choosing the best task size at run-time, and (3) explains some recent advances in work-queues that minimize their overhead.

Continue reading “Parallel programming: How to choose the best task-size?” »

May 282011

CodeIn response to my post on parallel programming, a slashdot reader wrote: “if hardware is to give us small cores, then they ought to provide bigger caches.” This accurate and astute comment brought me to the realization that we must discuss what large/small cores really mean to both hardware and software. In today’s post, I discuss this issue from a single thread perspective.

Most early processors were in-order cores. This trend changed during the mid-1990s with the introduction of Alpha 21264 and Intel P6 (aka. Pentium Pro), thus causing out-of-order cores to dominate the world of general-purpose computing until now. Recently, however, the industry has experienced a strong push to bring back the in-order cores. For example, IBM Power 6 went the in-order route after having built out-of-order cores for a decade. Additionally, Intel is promoting in-order cores in the Atom and Knights families and almost all ARM processors are in-order. Given this transition, I feel the need to raise awareness regarding the differences between these two types and how they impact software.

Continue reading “Back to One-Lane Roads: Programming the In-Order Cores” »

May 262011

This post is a follow up on my post titled why parallel programming is hard. To demonstrate parallel programming, I present a case study of parallelizing a kernel which computes a histogram. I use Open MP for parallelization (see the reason here). The post first introduces some basic parallel programming concepts and then deep dives into performance optimizations.

Update 5/30/2011: I have written my next post on parallel programming. It discusses the motivation, implementation, and trade-offs of dynamic work scheduling.

Continue reading “Writing and Optimizing Parallel Programs — A complete example” »

May 252011

I confess. I have an ulterior motive behind this post. Eventually I want to write a parallel programming tutorial which demonstrates the performance trade-offs in parallel programs. Since the focus of that tutorial is on performance, I prefer using the parallel programming framework with the least syntactic distractions.I think I will choose Open MP because it seems to be the cleanest alternative for parallelizing regular for-loops. This post is to familiarize readers with Open MP before the tutorial such that the tutorial can focus solely on performance optimizations.

Continue reading “Open MP vs pthreads” »

May 242011

I had a discussion with a colleague an hour ago about how much slower is python compared to C. His guess was 10x and I was expecting it to be less than 10x. It turns out that python is more than 10x slower. Before doing my own comparison, I first googled for a comparison but most of the existing comparisons had I/O operations which makes the comparison hazy because I/O takes the same amount of execution time in both languages, which indirectly favors the slower language. I decided I want my analysis to be solely compute-bound. I coded my favorite computer-bound kernel, a histogram, in both C and python and tested on an Intel Core2Quad system with 4GB of memory. The code and results follow.

Continue reading “Quick post: Python vs C in compute-bound workloads” »

May 242011

Dear readers, I recently got introduced to XMOS and the concept of open hardware thanks to Jonathan May. I got curious about XMOS when Jonathan responded to one my posts as “for proper multicore, look at @xmos.” I do feel that it’s a concept we should all be aware of.

On my request, Jonathan has written a wonderful introduction to XMOS specifically targeted to our audience with combined hardware/software focus. I hope you will enjoy this post. You can follow Jonathan at @jonathan_may.

Continue reading “An introduction to XMOS — an open hardware platform” »

May 222011

Dear readers, I would like to welcome Tauseef Rab as a guest blogger. Tauseef has extensive experience in logic design and circuit design, having worked in companies like Freescale, SigmaTel, Marvell, and Qualcomm. While my post covered topics from several fields, I agree with Tauseef that most interviews are more targeted to a specific topic. On my request, Tauseef has agreed to share a list of questions he finds most relevant for logic design. Tauseef, I look forward to your contributions in the future as well.

I think you should split your post [Ten fun hardware design questions] into two very relevant, yet unique topics: circuit design (transistor level) and logic design.

Following are the Top 7 Logic Design review topics in my mind (I will share the circuit design review sheet later).

Continue reading “Logic Design Review Sheet (by Tauseef)” »

May 212011

I wrote a list of 10 things programmers must know. I felt it was unfair to leave out the hardware guys. Since I have already covered what the hardware designers need to know about software (here here and here), I am now posting some important review topics for hardware designers. Make sure you revise these topics before job interviews. While this list targets architects and VLSI/RTL designers, it is a good resource for anyone looking to brush up their understanding of hardware design. It can also be used as a study guide to learn hardware on your own. I purposefully use keywords that can be googled.

Continue reading “Ten fun hardware design questions” »