ffind: a sane replacement for command line file search


Screen Shot 2013-03-26 at 22.53.13
I tend to use the UNIX command line A LOT. I find it very comfortable to work when I am developing and follow the “Unix as IDE” way. The command line is really rich, and you could probably learn a new different command or parameter each day and still be surprised every day for the rest of your life. But there are some things that sticks and gets done, probably not on the most efficient way.

In my case, is using the command `find` to search for files. 95% of the times I use it, is in this form:

find . -name '*some_text*'

Which means ‘find in this directory and all the subdirectories a file that contains some_text in its filename’

It’s not that bad, but I also use a lot ack, which I think is absolutely awesome. I think is a must know for anyone using Unix command line. It is a replacement for grep as a tool for searching code, and works the following way (again, in my 90% usage)

ack some_text

Which means ‘search in all the files that look like code under this directory and subdirectories that contains the text some_text (some_text can be a regex, but usually you can ignore that part)

So, after a couple of tests, I decided to make myself my own ack-inspired find replacement, and called it ffind. I’ve been using it for the last couple of days, and it integrates quite well on my workflow (maybe surprisingly, as I’ve done it with that in mind)

Basically it does this

ffind some_text

Which means ‘find in this directory and all the subdirectories a file that contains some_text in its filename’ (some_text can be a regex). It has also a couple of interesting characteristics like it will ignore hidden directories (starting with a dot), but not hidden files, it will skip directories that the user is not allowed to read due permissions  and the output will have by default the matching text in color.

The other use case is

ffind /dir some_text

Which means ‘find in the directory ‘/dir’ and all the subdirectories a file that contains some_text in its filename’

There are a couple more params, but they are there to deal with special cases.

It is done in Python, and it is available in GitHub. So, if any of this sounds interesting, go there and feel free to use it! Or change it! Or make suggestions!

ffind in Github

ffind in Github

UPDATE: ffind is now available in PyPI.

Magical thinking in Software Development


I guess we all Python developers heard this kind of argument from time to time:

Python is slower than C++/Java/C# because is not compiled.

Other than the usual “blame the others” when working with other companies (usually big corporations than thinks than using anything except C# or Java is laughable), you can also see a lot of comments in technical blogs or places like Hacker News or Reddit with similar, simplistic arguments. You can recognise them on the usual rants about how technology X is The Worst Thing That Ever Happened™ and Should Never Be Used™

That’s a form of Software Development Magical Thinking. This can be really harmful for software development, specially when the opposite, positive form is used. Let me define Software Development Magical Thinking in this context:

Software Development Magical Thinking noun Assuming that a technology will magically avoid a complex problem just by itself.

Probably that will become clearer after a couple of examples:

Java is a static type language and it is safer than dynamic type languages like Ruby.

We program in C++ so our code is very fast.

MongoDB / NodeJS / Riak is web-scale.

Please note that those are not completely, utterly wrong statements. C++ can be very fast. Static typed languages can avoid some bugs related with input parameters type. But there is no guarantee that creating a system in C++ is going to act like a magic wand against slow code. Or that Erlang will avoid having a single point of failure. And you’ll get as sick of bugs and security issues both on static type language and dynamic type languages. *

Those are all complex problems that need careful design and possibly measurements to deal with them. Deep analysis of the problem, which usually is more complicated that looks on the first place. Or even worst, the problem is not as bad as it looked and the designed system is more complex that it should, trying to catch a problem that never arises. Not to exclude having previous experience to avoid subtle errors.

Let me say it again. There are problems that are HARD. In software systems they are confronted almost daily. And no single thing will make you forget them. Even if you use a very good tool for what you’re doing (like Erlang for concurrency), which usually implies paying a price (in development time, etc), doesn’t replace vigilance and issues could eventually appear. Unfortunately, making software is tough.

The problem with Software Development Magical Thinking is that it is very easy and it is also very natural. Seductive. We know that “general Magical Thinking”, simple solutions to very complex problems, is quite common. Hey, a lot of times, it even seems to work, because the Feared Problem will only present after certain size that is never attained, or after the designer leave the company and left a latent problem behind. Most of the time, making a totally informed decision is unrealistic, or simply not possible, and some risks must be taken.

But as software developers we should know that things are not that easy, even if we have to compromise. Each bug that takes time methodically eliminating causes. Every measurement that makes you wonder what is the best metric to reflect a value. Every time you realise that there was a back-of-the-envelope calculation that shows something that will have an impact on some design aspects. Those are all reminders that should makes us think that there are no silver bullets and we shouldn’t take lightly all those difficult problems.

Make decisions. Design systems. Choose a tool over others. Take risks. But don’t be delusional and careless. Be conscious that software can bite you back. Be vigilant. Be skeptic. Avoid Magical Thinking.

PD: And please, don’t say “Python is slow”. Just don’t. It is not for most of the jobs. It is not going to make you win a discussion unless you carefully measure and proof it. And, perhaps most importantly, raises my urge to kill.

* No, I am not going to comment anything the Mythical Web Scale property.

EDIT: Wow, it has been submitted to Hacker News here. Just in case any one whats to add to the discussion there.

Mis softwarevaches para trabajar (y II)


Continuando el post que escribí el otro día acerca de los cachivaches que uso para trabajar, voy a hablar también un poco de el software que utilizo.

Trabajo programando en Python sobre OS X, de manera que muchos de los programas que comento están orientados a este entorno. Mi trabajo es desarrollando código para servidores que luego corren en Linux, así que muchas de las herramientas pueden usarse en Linux. He usado recientemente también HTML, JavaScript y Ruby, siendo el entorno totalmente aplicable.

Continue reading

Talks on PyCon Ireland 2012


Well, as usual, this year’s PyCon Ireland has been amazing. I always get impressed by the high quality of the talks and, in general, how much the attendants know. It is always a pleasure to share some thoughts about technology with incredibly talented people. Python Ireland is doing a great job.

This year I didn’t give one talk, but TWO! It was very exhausting, but fun. I am posting the slides here, in case someone find them interesting…

You can also download the source Keynote file, which includes notes.

EDIT: Videos added

Respect your production data


I read yesterday this blog post: I Accidentally Deleted All Our Data by Taylor Fausak. Probably you’ll end with the same expression in your face that I did. An a palm covering it.

Something in advance. It takes GREAT courage and openness to tell in your blog this story. I think is really a great attitude about it.

Saying this, I must say that the whole story a recipe for disaster. Lots of steps make my spider-sense to tingle. Strongly.

Doing a script on the python interactive shell to update your production data, while in a convention, between presentations… Well, it’s not the right moment to do ANYTHING that could change your data. A quick look a monitoring tool, that’s grand. But anything more complex that that is highly risky. And specially using the interactive shell.

You have to RESPECT your production environment and data. Ideally, every change in production should be automated and tested before in advance. That means everything but the most extreme cases, like bugs that are blocking the whole application. Sometimes, in extraordinary cases, could necessary to take extraordinary measures. But it should always be treated with the proper caution.

You have to set all your attention each time you have to change anything on production and have a clear view in advance of what are you trying to do. Think really carefully what are you going to do. And double check everything you type. Every step that has not been previously tested on a staging environment is a possible disaster for your application.

Anyway, stories like that only make me remember how much attention should I put into changing production data and keep a healthy fear of what could happen. Treat your production environment with proper R-E-S-P-E-C-T or it can bite. Hard.

Bonus: Really, really, REALLY the best way of testing that something works is SAVING everything AGAIN????????

This code give me nightmares…


from mongoengine import connect
from models import Family
connect('the-production-database')
for family in Family.objects:
   family.save()

Utopia Kingdoms scaling case. From 4 users to 90k+


I almost forget to put this presentation I gave in PyCon Ireland 2011 this month. It’s about some problems and solutions working on Utopia Kingdoms game regarding scalability.

So, here are the slides

UPDATE: In case anyone is interested, here is the talk, courtesy of PyCon Ireland ;-)

 

Django and Rails and Grails, Oh my!


On the PyCon Ireland I give a talk comparing between Django, Ruby on Rails and Grails framework… I just forget to put a link on this blog!

The presentation can be found at Prezi, and there is even a video, if someone wants to make funny comments on my exotic accent :-P A problem with the projector doesn’t allow me to display the slides, so I felt a little weird taking the laptop and pointing at the screen, but the people making the video has make their homework and shows the proper slides on place. Nice!

Presentation

Video

The original idea was to show the same simple application (a simple posting service) make with the three frameworks, but not being able to display on the projector really ruined it. Anyway, the code can be downloaded here, if you want to take a look.

Let me know what do you think!

Database madness with mongoengine and SQLAchemy


Yesterday I gave a presentation in the Python Ireland October meeting about some work we are doing with mongoengine and SQLAchemy and how we are managing three databases (MS SQL server, MySQL and MongoDB) on an online football management game we are working on.

So, here are the slides, so feel free to make comments, ask questions and even criticize them!

You can also download the presentation on PDF here.

PD: When I talk about football game, I’m referring to soccer.

Commenting the code


please_explainI always find surprising to find out comments like that regarding code comment. I can understand that someone argues about that writing comments on the code is boring, or that you forget about it or whatever. But to say that the code shouldn’t be commented at all looks a little dangerous to me.

That doesn’t mean that you’ll have to comment everything. Or that adding a comment it’s an excuse to not be clear directly on the code, or the comment should be repeat what is on the code. You’ll have to keep a balance, and I agree that it’s something difficult and everyone can have their opinion about when to comment and when not.

Also, each language has it’s own “comment flow”, and definitively you’ll make more comments on low level languages like C than in a higher level language like Python, as the language it’s more descriptive and readable. Ohhh, you have to comment so many things in C if you want to be able to understand what a function does it in less that a couple of days… (the declaration of variables, for example) #

As everyone has their own style when it comes to commenting, I’m going to describe some of my personal habits commenting the code to open the discussion and compare with your opinions (and some example Python code):

    • I put comments summarizing code blocks. That way, when I have to localize a specific section of the code, I can go faster reading the comments and ignoring the code until getting to the relevant part. I also tend to mark those blocks with newlines.
# Obtain the list of elements from the DB
.... [some lines of code]

# Filter and aggregate the list to obtain the statistics
...  [some lines of code]

UPDATED: Some clarification here, as I think that probably I have choose the wrong example. Of course, if blocks of code gets more than a few lines and/or are used in more than one place, will need a function (and a function should ALWAYS get a docstring/comment/whatever) . But some times, I think that a function is not needed, but a clarification is good to know quickly what that code is about. The original example will remain to show my disgrace, but maybe this other example (I have copy-paste some code I am working right now and change a couple of things)
It’s probably not the most clean code in the world, and that’s why I have to comment it. Latter on, maybe I will refactor it (or not, depending on the time).

               # Some code obtaining elements from a web request ....

                # Delete existing layers and requisites
                update = Update.all().filter(Update.item == update).one()
                UpdateLayer.all().filter(UpdateLayer.update_id == update.item_id).delete()
                ItemRequisite.all().filter(ItemRequisite.item == update).delete()

                # Create the new ones
                for key, value in request.params.items():
                    if key == 'layers':
                        slayer = Layer.all().filter(Layer.layer_number == int(value)).one()
                        new_up_lay = UpdateLayer(update=update, layer=slayer)
                        new_up_lay.save()
                    if key == 'requisites':
                        req = ShopItem.all().filter(ShopItem.internal_name == value).one()
                        new_req = ShopItemRequisite(item=update, requisite=req)
                        new_req.save()
  • I describe briefly every non-trivial operation, specially mathematical properties or “clever tricks”. Optimization features usually needs some extra description telling why a particular technique is used (and how it’s used).
# Store found primes to increase performance through memoization
# Also, store first primes
found_primes = [2,3]

def prime(number):
    ''' Find recursively if the number is a prime. Returns True or False'''

    # Check on memoized results
    if number in found_primes:
        return True

    # By definition, 1 is not prime
    if number == 1:
        return False

    # Any even number is not prime (except 2, checked before)
    if number % 2 == 0:
        return False

    # Divide the number between all their lower prime numbers (excluding 2)
    # Use this function recursively
    lower_primes = (i for i in xrange(3,number,2) if prime(i))
    if any(p for p in lower_primes if number % p == 0) :
        return False

    # The number is not divisible, it's a prime number
    # Store to memoize
    found_primes.append(number)
    return True

(Dealing with prime numbers is something that deserves lots of comments!) EDIT: As stated by Álvaro, 1 is not prime. Code updated.

  • I put TODOs, caveats and any indication of further work, planned or possible.
# TODO: Change the hardcoded IP with a dynamic import from the config file on production.
...
# TODO: The decision about which one to use is based only on getting the shorter one. Maybe a more complex algorithm has to be implemented?
...
# Careful here! We are assuming that the DB is MySQL. If not, this code will probably not work.
...

UPDATE: That is probably also related to the tools I use. S.Lott talks about Sphinx notations, which is even better. I use Eclipse to evelop, which takes automatically any “TODO” on the code and make a list with them. I find myself more and more using “ack-grep” for that, curiously…

    • I try to comment structures as soon as they have more than a couple of elements. For example, in Python I make extensive use of lists/dictionaries to initialize static parameters in table-like format, so use a comment as header to describe the elements.
# Init params in format: param_name, value
init_params = (('origin_ip','123.123.123.123'),
               ('destiny_ip','456.456.456.456'),
               ('timeout',5000),
              )
for param_name, value in init_params:
    store_param(param_name, value)
  • Size of the comment is important, it should be short, but clearness goes first. So, I try to avoid shorting words or using acronyms (unless widely used). Multiline comments are welcome, but I try to avoid them as much as possible.
  • Finally, when in doubt, comment. If at any point I have the slightest suspicious that I’m going to spend more than 30 seconds understanding a piece of code, I put a comment. I can always remove it later the next time I read that code and see that is clear enough (which I do a lot of times). Being both bad, I prefer one non-necessary comment than lacking one necessary one.
  • I think I tend to comment sightly more than other fellow programmers. That’s just a particular, completely unmeasured impression.

What are your ideas about the use of comments?

UPDATE: Wow, I have a reference on S.Lott blog, a REALLY good blog that every developer should follow. That’s an honor, even if he disagrees with me on half the post ;-)

On one of my first projects on C, we follow a quality standard that requires us that 30% of the code lines (not blank ones) should be comments.

ORMs and threads


Do you remember the post from Joel Spolsky about leaking abstractions? It’s the kind of idea that, the first time I read, about it, was intrigued, but after some time, I began to see it on every place. There are from time to time some problems on my Python code (as well as in other high-level languages) that I am really glad to be able to have an idea of the underlaying low level C, or I will be struggling with some very weird, confusing problems. I have enough confusing and weird problems of my own to add more…

One of my recent leaking abstractions has come using a ORM, in particular mongoengine, but I think it will happen probably on every ORM. On a web application I am developing at the moment, we need to launch a thread to perform some operations, in a timed manner. A request comes to the server, launches a thread, and then that thread stores its status on the database. Then the user can check the status from the database (and do more operations, like pause, etc, but that I will leave that). While performing some tests on the application, I made the following code:

def testing():
    user = User.objects.get(TEST_USER)
    user.launch_thread()
    time.sleep(TIME)
    assert user.status == END

Inside the thread, the code looks similar to this

def thread(user_id):
    thread_user = User.objects.get(user_id)
    # Do things that take a while, but less than TIME
    thread_user.status = END
    thread_user.save()

Ok, so we’re getting an object from the database, the object launches a thread that changes its state to END and saves it after a while. Well, not really… Obviously it’s not working (or I wouldn’t be writing this). But we all already know that threads are the root of all evil, and always have nasty hidden surprises.

The error I was making was assuming (and that’s the abstraction in my mind) that the ORM maps the database into memory, and that the copy is unique. After all, that’s why you have a database. But it’s not true. What it’s happening here it’s that we are creating two different objects in memory. I have (now) used two different names, user and thread_user. In my code I used the same name (user) which probably adds to the confusion. Each one reflects the status of the database when you read the database, but after that, you are not updating the object with the real information on the DB. So, the user object has still the starting status, the first one, as we haven’t refreshed it with the new and changed information that another, rogue thread, has changed while we naively thought that was under our control.

Usually, on a web application (at least the ones developed with high-level tools) the usual situation is having a request, read the data from the DB using a ORM, change something, and then save. We don’t have rogue threads interrupting that operations and requests can be processed fast enough. And even user data is different so two users probably don’t need to write any related information. BUT definitively another request (faster one) could interrupt the process and make the data to not be coherent. It’s going to be (extremely) rare in most applications, but in case of long, threaded operations, could be important to be aware of this and try not to relay on the ORM as a virtual copy of the DB, but to read and write in short operations. Or lock the database.

Just one more thing. It’s possible to use only one object in memory, and pass it to the thread, and avoiding this problems. But that could generate others, like not storing (and loading) any intermediate steps of the process. So, in case the thread is stopped (for example, a server restart), the process is totally lost. Any operation that takes time to execute will ideally have some “resume” process, so that will include storing the partial state, as well as a resume, which will need to read from the DB. Also, in this particular case, there are more than one thread working the same process, communicating through the DB.

But wait! There is still a little more unexpected and funny behavior!

To reload the user object, my first idea was to generate a refresh method, this way:

class User(mongoengine.Document):
     ...

     def refresh(self):
          ''' Refresh the object '''
          self = User.objects.get(self.id)

And again, it’s not working… :-(
Again, the problem is an abstraction. self it’s not always the object, not outside the method. It’s just a label (or pointer, if you know C) to the object. Yes, we have created a new object called selfwhich has the new (and correct) object. BUT the label user is still pointing to the not-updated object we have since the beginning.

So… no shortcuts, we will have to reload the object after the sleep to check that the object on the DB it’s behaving properly

def testing():
    user = User.objects.get(TEST_USER)
    user.launch_thread()
    time.sleep(TIME)
    user = User.objects.get(TEST_USER)
    assert user.status == END