Showing posts with label algorithms. Show all posts
                                  Showing posts with label algorithms. Show all posts

                                  Wednesday, July 8, 2015

                                  The Discriminatory Dark Side Of Big Data

                                  It has happened again. Researchers have discovered that Google’s ad-targeting system is discriminatory. Male web users were more likely to be shown high paying executive ads compared to female visitors. The researchers have published a paper which was presented at the Privacy Enhancing Technologies Symposium in Philadelphia.

                                  I had blogged about the dark side of Big Data almost two years back. Latanya Sweeney, a Harvard professor Googled her own name to find out an ad next to her name for a background check hinting that she was arrested. She dug deeper and concluded that so-called black-identifying names were significantly more likely to be the targets for such ads. She documented this in her paper, Discrimination in Online Ad Delivery. Google then denied AdWords being discriminatory in anyway and Google is denying to be discriminatory now.

                                  I want to believe Google. I don’t think Google believes they are discriminating. And, that’s the discriminatory dark side of Big Data. I have no intention to paint a gloomy picture and blame technology, but I find it scary to observe that technology is changing much faster than the ability of the brightest minds to comprehend the impact of it.

                                  A combination of massively parallel computing and sophisticated algorithms to leverage this parallelism as well as ability of algorithms to learn and adapt without any manual intervention to be more relevant, almost in real-time, are going to cause a lot more of such issues to surface. As a customer you simply don't know whether the products or services that you are offered or not at a certain price is based on any discriminatory practices. To complicate this further, in many cases, even companies don't know whether insights they derive from a vast amount of internal as well as external data are discriminatory or not. This is the dark side of Big Data.

                                  The challenge with Big Data is not Big Data itself but what companies could do with your data combined with any other data without your explicit understanding of how algorithms work. To prevent discriminatory practices, we see employment practices being audited to ensure equal opportunity and admissions to colleges audited to ensure fair admission process, but I don't see how anyone is going to audit these algorithms and data practices.

                                  Disruptive technology always surfaces socioeconomic issues that either didn't exist before or were not obvious and imminent. Some people get worked up because they don't quite understand how technology works. I still remember politicians trying to blame GMail for "reading" emails to show ads. I believe that Big Data is yet another such disruption that is going to cause similar issues and it is disappointing that nothing much has changed in the last two years.

                                  It has taken a while for the Internet companies to figure out how to safeguard our personal data and they are not even there, but their ability to control the way this data could get used is very questionable. Let’s not forget data does not discriminate, people do. We should not shy away from these issues but should collaboratively work hard to highlight and amplify what these issues might be and address them as opposed to blame technology to be evil.

                                  Photo courtesy: Kutrt Bauschardt

                                  Friday, June 15, 2012

                                  Proxies Are As Useful As Real Data

                                  Last year I ran a highly unscientific experiment. I would regularly put a DVD in an open mail bin in my office to mail it back to Netflix, every late Monday afternoon. I would also count the total number of Netflix DVDs put inside that bin by other people. Over a period of time I observed a continuous and consistent decline in the number of DVDs. I compared my results with the numbers released by Netflix. They matched. I'm not surprised. Even though this was an unscientific experiment on a very small sample size with a high degree of variables, it still gave me insights into the overall real data, that I otherwise had no access to.

                                  Proxies are as useful as real data.

                                  When Uber decides to launch a service in a new city or when they are assessing demand in an existing city they use crime data as surrogate to measure neighborhood activity. This measurement is a basic input in calculating the demand. There are many scenarios and applications where access to a real dataset is either prohibitively expensive or impossible. But, a proxy is almost always available and it is good enough in many cases to make certain decisions that eventually can be validated by real data. This approach, even though simple, is ignored by many product managers and designers. Big Data is not necessarily solving the problem of access to a certain data set that you may need, to design your product or make decisions, but it is certainly opening up an opportunity that didn't exist before: ability to analyze proxy data and use algorithms to correlate them with your own domain.

                                  As I have argued before, the data external to an organization is probably far more valuable than the data that they internally have. Until now the organizations barely had capabilities to analyze a subset of their all internal data. They could not even think of doing anything interesting with the external data. This is rapidly going to change as more and more organizations dip their toes in Big Data. Don't discriminate any data sources, internal or external.

                                  Probably the most popular proxy is the per-capita GDP to measure the standard of living. The Hemline Index is yet another example where it is believed that the women's skirts become shorter (higher hemline) during good economic times and longer during not-so-good economic times.

                                  Source: xkcd
                                  Proxy is just a beginning of how you could correlate several data sources. But, be careful. As wise statisticians will tell you, correlation doesn't imply causation. One of my personal favorite example is the correlation between the Yankees winning the worldseries and a democratic president in the oval office. Correlation doesn't guarantee causation, but it gives you insights into where to begin, what question to ask next, and which dataset might hold a key to that answer.This iterative approach wasn't simply feasible before. By the time people got an answer to their first question, it was too late to ask the second question. Ability to go after any dataset anytime you want opens up a lot more opportunities. At the same time when Big Data tools, computing, and access to several external public data sources become a commodity it would come down to human intelligence prioritizing the right questions to ask. As Peter Skomoroch, a principal data scientist at LinkedIn, puts it "'Algorithmic Intuition' is going to be as important a skill as 'Product Sense' in the next decade."

                                  Friday, June 1, 2007

                                  Moore's law for software

                                  Software design has strange relationship with the computing resources. If the resources are low it is difficult to design and if the resources are in abundance it is a challenge to utilize them. It is rather odd to ask the designers and developers to have , but this is true and it is happening.

                                  The immense computing resources have opened up a lot of opportunities for the designers and developers to design agile and highly interactive web interfaces by tapping into this computing cloud. Effective resource utilization by software is by far lagging the fast growing computing resources. Google has successfully demonstrated a link between the humongous cloud infrastructure and the applications that effectively use these resources. Gmail and Google Maps are examples of agile and highly interactive interfaces that consumes heavy resources. Google's
                                  MapReduce is an example of effective utilization of the computing resources by designing the search to use heavy parallelization. One of the challenges that designers face these days is to be able construct an application from an interaction perspective such that it can actually use the available resources effectively to provide better use experience. Traditionally the performance tuning is all about fixing software to perform faster without adding extra computing resources. The designers and developers now have a challenge to actually use the resources. The cloud computing is going to be more and more relevant as various organizations catch up on Web 2.0 and Enterprise 2.0. Google, Yahoo, Salesforce, and Microsoft are betting on their huge infrastructure that can deliver the juice required for their applications. Cloud computing is not just about hardware - it is about the scale of computing and the infrastructure that is required to get to that scale such as physical location, energy and cooling requirements, dark fiber etc.

                                  Not every single piece of code in software can be parallelized. Developers hit a set of serial tasks in the code flow for many dynamic conditions. Semantic search is a classic example that has many challenges to use parallel computing resources since you do end up serializing certain tasks due to the dynamic nature of many semantic search engines and their abilities of natural language processing. Cognitive algorithms are not the same as statistical or relevancy algorithms and require a radically different design approach to effectively utilize the available resources.

                                  Intel has been pushing the industry to improve performance on its multi core CPU . Microsoft recently announced an initiative to redesign the next Windows for multiple cores. The design is not just about one, two, or three cores. The resources are going to increase at much faster pace and software designers and developers are late to react and follow this massive computing. Computing in a cloud requires a completely different kind of approach in software design and there are some great opportunities to innovate around it.