必发365手机网页版|官网

                      
                      

                                  Showing posts with label hadoop. Show all posts
                                  Showing posts with label hadoop. Show all posts

                                  Wednesday, April 18, 2012

                                  4 Big Data Myths - Part II



                                  This is the second and the last part of this two-post series blog post on Big Data myths. If you haven't read the first part, check it out here.

                                  Myth # 2: Big Data is an old wine in new bottle

                                  I hear people say, "Oh, that Big Data, we used to call it BI." One of the main challenges with legacy BI has been that you pretty much have to know what you're looking for based on a limited set of data sources that are available to you. The so called "intelligence" is people going around gathering, cleaning, staging, and analyzing data to create pre-canned "reports and dashboards" to answer a few very specific narrow questions. By the time the question is answered its value has been diluted. These restrictions manifested from the fact that the computational power was still scarce and the industry lacked sophisticated frameworks and algorithms to actually make sense out of data. Traditional BI introduced redundancies at many levels such as staging, cubes etc. This in turn reduced the the actual data size available to analyze. On top of that there were no self-service tools to do anything meaningful with this data. IT has always been a gatekeeper and they were always resource-constrained. A lot of you can relate to this. If you asked the IT to analyze traditional clickstream data you became a laughing stroke.

                                  What is different about Big Data is not only that there's no real need to throw away any kind of data, but the "enterprise data", which always got a VIP treatment in the old BI world while everyone else waited, has lost that elite status. In the world of Big Data, you don't know which data is valuable and which data is not until you actually look at it and do something about it. Every few years the industry reaches some sort of an inflection point. In this case, the inflection point is the combination of cheap computing — cloud as well as on-premise appliances — and emergence of several open computing data-centric software frameworks that can leverage this cheap computing.

                                  Traditional BI is a symptom of all the hardware restrictions and legacy architecture unable to use relatively newer data frameworks such as Hadoop and plenty of others in the current landscape. Unfortunately, retrofitting existing technology stack may not be that easy if an organization truly wants to reap the benefits of Big Data. In many cases, buying some disruptive technology is nothing more than a line item in many CIOs' wish-list. I would urge them to think differently. This is not BI 2.0. This is not a BI at all as you have known it.


                                  Myth # 1: Data scientist is a glorified data analyst

                                  The role of a data scientist has exponentially grown in its popularity. Recently, DJ Patil, a data scientist in-residence at Greylock, was featured on Generation Flux by Fast Company. He is the kind of a guy you want on your team. I know of a quite a few companies that are unable to hire good data scientists despite of their willingness to offer above-market compensation. This is also a controversial role where people argue that a data scientist is just a glorified data analyst. This is not true. Data scientist is the human side of Big Data and it's real.

                                  If you closely examine the skill set of people in the traditional BI ecosystem you'll recognize that they fall into two main categories: database experts and reporting experts. Either people specialize in complicated ETL processes, database schemas, vendor-specific data warehousing tools, SQL etc. or people specialize in reporting tools, working with the "business" and delivering dashboards, reports etc. This is a broad generalization, but you get the point. There are two challenges with this set-up: a) the people are hired based on vendor-specific skills such as database, reporting tools etc. b) they have a shallow mandate of getting things done with the restrictions that typically lead to silos and lack of a bigger picture.

                                  The role of a data scientist is not to replace any existing BI people but to complement them. You could expect the data scientists to have the following skills:

                                  • Deep understanding of data and data sources to explore and discover the patterns at which data is being generated. 
                                  • Theoretical as well practical (tool) level understanding of advanced statistical algorithms and machine learning.
                                  • Strategically connected with the business at all the levels to understand broader as well deeper business challenges and being able to translate them into designing experiments with data.  
                                  • Design and instrument the environment and applications to generate and gather new data and establish an enterprise-wide data strategy since one of the promises of Big Data is to leave no data behind and not to have any silos.

                                  I have seen some enterprises that have a few people with some of these skills but they are scattered around the company and typically lack high level visibility and an executive buy-in.

                                  Whether data scientists should be domain experts or not is still being debated. I would strongly argue that the primary skill to look for while hiring a data scientist should be how they deal with data with great curiosity and asking a lot of whys and not what kind of data they are dealing with. In my opinion if you ask a domain expert to be a data expert, preconceived biases and assumptions — knowledge curse —  would hinder the discovery. Being naive and curious about a specific domain actually works better since they have no pre-conceived biases and they are open to look for insights in unusual places. Also, when they look at data in different domains it actually helps them to connect the dots and apply the insights gained in one domain to solve problems in a different domain.

                                  No company would ever confess that their decisions are not based on hard facts derived from extensive data analysis and discovery. But, as I have often seen, most companies don't even know that many of their decisions could prove to be completely wrong had they have access to right data and insights. It's scary, but that's the truth. You don't know what you don't know. BI never had one human face that we all could point to. Now, in the new world of Big Data, we can. And it's called a data scientist.

                                  Photo courtesy: Flickr

                                  Friday, March 30, 2012

                                  4 Big Data Myths - Part I



                                  It was cloud then and it's Big Data now. Every time there's a new disruptive category it creates a lot of confusion. These categories are not well-defined. They just catch on. What hurts the most is the myths. This is the first part of my two-part series to debunk Big Data myths.

                                  Myth # 4: Big Data is about big data

                                  It's a clear misnomer. "Big Data" is a name that sticks but it's not just about big data. Defining a category just based on size of data appears to be quite primitive and rather silly. And, you could argue all day about what size of data qualifies as "big." But, the name sticks, and that counts. The insights could come from a very small dataset or a very large data set. Big Data is finally a promise not to discriminate any data, small or large.

                                  It has been prohibitively expensive and almost technologically impossible to analyze large volumes of data. Not any more. Today, technology — commodity hardware and sophisticated software to leverage this hardware — changes the way people think about small and large data. It's a data continuum. Big Data is not just about technology, either. Technology is just an enabler. It has always been. If you think Big Data is about adopting new shiny technology, that's very limiting. Big Data is an amalgamation of a few trends - data growth of a magnitude or two, external data more valuable than internal data, and shift in computing business models. The companies mainly looked at their operational data, invested into expensive BI solutions, and treated those systems as gold. Very few in a company got very little value out of those systems.

                                  Big Data is about redefining what data actually means to you. Examine the sources that you never cared to look at before, instrument your systems to generate the kind of data that are valuable to you and not to your software vendor. This is not about technology. This is about completely new way of doing business where data finally gets the driver's seat. The conversations about organizations' brands and their competitors' brands are happening in social media that they neither control nor have a good grasp of. At Uber, Bradly Voytek, a neuroscientist is looking at interesting ways to analyze real-time data to improve the way Uber does business. Recently, Target came under fire for using data to predict future needs of a shopper. Opportunities are in abundance.

                                  Myth # 3: Big Data is for expert users    

                                  The last mile of Big Data is the tools. As technology evolves the tools that allow people to interact with data have significantly improved, as well. Without these tools the data is worth nothing. The tools have evolved in all categories ranging from simple presentation charting frameworks to complex tools used for deep analysis. With rising popularity and adoption of HTML 5 and people's desire to consume data on tablets, the investment in presentation side of the tools have gone up. Popular javascript frameworks such as D3 have allowed people to do interesting things such as creating a personal annual report. Availability of a various datasets published by several public sector agencies in the US have also spurred some creative analysis by data geeks such as this interactive report that tracks money as people move to different parts of the country.

                                  The other exciting trend has been the self-service reporting in the cloud and better abstraction tools on top of complex frameworks such as Hadoop. Without self-service tools most people will likely be cut off from the data chain even if they have access to data they want to analyze. I cannot overemphasize how important the tools are in the Big Data value chain. They make it an inclusive system where more people can participate in data discovery, exploration, and analysis. Unusual insights rarely come from experts; they invariably come from people who were always fascinated by data but analyzing data was never part of their day-to-day job. Big Data is about enabling these people to participate - all information accessible to all people.

                                  Coming soon in the Part II: Myth # 2 and Myth # 1.

                                  Monday, November 7, 2011

                                  Early Signs Of Big Data Going Mainstream


                                  Today, Cloudera announced a new $40m funding round to scale their sales and marketing efforts and a partnership with NetApp where NetApp will resell Cloudera's Hadoop as part of their solution portfolio. These both announcements are critical to where the cloud and Big Data are headed.

                                  Big Data going mainstream: Hadoop and MapReduce are not only meant for Google, Yahoo, and fancy Silicon Valley start-ups. People have recognized that there's a wider market for Hadoop for consumer as well as enterprise software applications. As I have argued before Hadoop and Cloud is a match made in heaven. I blogged about Cloudera and the rising demand of data-centric massive parallel processing almost 2.5 years back, Obviously, we have come a long way. The latest Hadoop conference is completely sold out. It's good to see the early signs of Hadoop going mainstream. I am expecting to see similar success for companies such as Datastax (previously Riptano) which is a "Cloudera for Cassandra."

                                  Storage is a mega-growth category: We are barely scratching the surface when it comes to the growth in the storage category. Big data combined with the cloud growth is going to drive storage demand through the roof and the established storage vendors are in the best shape to take advantage of this opportunity. I wrote a cloud research report and predictions this year with a luminary analyst Ray Wang where I mentioned that cloud storage will be a hot cake and NoSQL will skyrocket. It's true this year and it's even more true next year.

                                  Making PaaS even more exciting: PaaS is the future and Hadoop and Cassandra are not easy to deploy and program. Availability of such frameworks at lower layers makes PaaS even more exciting. I don't expect the PaaS developers to solve these problems. I expect them to work on providing a layer that exposes the underlying functionality in a declarative as well as a programmatic way to let application developers pick their choice of PaaS platform and build killer applications.

                                  Push to the private cloud: Like it or not, availability of Hadoop from an "enterprise" vendor is going to help the private cloud vendors. NetApp has a fairly large customer base and their products are omnipresent in large private data centers. I know many companies that are interested in exploring Hadoop for a variety of their needs but are somewhat hesitant to go out to a public cloud since it requires them to move their large volume of on-premise data to the cloud. They're more likely to use a solution that comes to their data as opposed to moving their data to where a solution resides.

                                  Friday, June 1, 2007

                                  Moore's law for software

                                  Software design has strange relationship with the computing resources. If the resources are low it is difficult to design and if the resources are in abundance it is a challenge to utilize them. It is rather odd to ask the designers and developers to have , but this is true and it is happening.

                                  The immense computing resources have opened up a lot of opportunities for the designers and developers to design agile and highly interactive web interfaces by tapping into this computing cloud. Effective resource utilization by software is by far lagging the fast growing computing resources. Google has successfully demonstrated a link between the humongous cloud infrastructure and the applications that effectively use these resources. Gmail and Google Maps are examples of agile and highly interactive interfaces that consumes heavy resources. Google's
                                  MapReduce is an example of effective utilization of the computing resources by designing the search to use heavy parallelization. One of the challenges that designers face these days is to be able construct an application from an interaction perspective such that it can actually use the available resources effectively to provide better use experience. Traditionally the performance tuning is all about fixing software to perform faster without adding extra computing resources. The designers and developers now have a challenge to actually use the resources. The cloud computing is going to be more and more relevant as various organizations catch up on Web 2.0 and Enterprise 2.0. Google, Yahoo, Salesforce, and Microsoft are betting on their huge infrastructure that can deliver the juice required for their applications. Cloud computing is not just about hardware - it is about the scale of computing and the infrastructure that is required to get to that scale such as physical location, energy and cooling requirements, dark fiber etc.

                                  Not every single piece of code in software can be parallelized. Developers hit a set of serial tasks in the code flow for many dynamic conditions. Semantic search is a classic example that has many challenges to use parallel computing resources since you do end up serializing certain tasks due to the dynamic nature of many semantic search engines and their abilities of natural language processing. Cognitive algorithms are not the same as statistical or relevancy algorithms and require a radically different design approach to effectively utilize the available resources.

                                  Intel has been pushing the industry to improve performance on its multi core CPU . Microsoft recently announced an initiative to redesign the next Windows for multiple cores. The design is not just about one, two, or three cores. The resources are going to increase at much faster pace and software designers and developers are late to react and follow this massive computing. Computing in a cloud requires a completely different kind of approach in software design and there are some great opportunities to innovate around it.

                                                      
                                                      

                                                                  Variety show

                                                                  constellation

                                                                  Blog

                                                                  constellation

                                                                  reading

                                                                  Variety show

                                                                  Premier League

                                                                  Mobile phone

                                                                  explore