Hadoop acm presentation


Published on

Microsoft Hadoop presentation for ACM Data Mining Hackathon competition.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Good afternoon. Thanks for coming, I know you're going to be really excited about this. I'm going to talk about Big Data, Hadoop and Microsoft It's just simply amazing to see the growing momentum around Big Data conversations happening today. Hadoop is changing the conversations that we have about Data, Big Data. I want to make sure we stay grounded in thinking through how to make money and save money with your Data using Hadoop.<next slide>
  • Let’s talk about size for a moment. The example I like to use is the US library of congress. The US library of congress has millions of books, recording, photographs, maps, music and manuscripts. All put together they have around 300TB of information. How much is that? That's 838 miles of bookshelves; If you were to stretch those out end to end, then go downstairs, get in your car and start driving at 65mph you'd hit the end of the books 13 hours later in New York City.A little over three times that is a petabyte. Microsoft is managing well over 100 Petabytes of data across our online properties. That single row of bookshelfs from New York to Jacksonville Florida is now half a mile high. That’s stunning.We are adding 7.5PBs per month of new data, running 20k analytic jobs per day to run our online services business.The good news is that hardware is fast and cheap enough that now we can record this data and consume it. This simply wasn’t possible a few years ago. Hard drive density and CPU power continue to double every 18 months.From the Microsoft point of view we have a pretty good understanding of how to build and operate one of these infrastructures and in the end connect it thru to developers and end users. We’re the only ones in the 100+Petabyte Club who also run an enterprise software and cloud business. I see the complete solution where we enable developers to build applications on this data; and connect them through to our end users with BI tools to deliver Breakthrough Big Data Insights. I will talk more about that in the Hadoop and Excel talk and take this down to a practical level in my second talk. This leads me onto the next concept.. It’s all about the data
  • It’s all about your Data, Actually, it’s all about your big dataIs it big as in Volume? Where your data exceeds limits of physical capabilities of systems today.Is it Velocity? The data is moving at a fast rate and value can decay over time.Is it Variability? of structure from unstructured, semi-structured to highly structured data.Doug Laney http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdfThe answer is it’s all of the above.Now that you have Big Data; you have two problems. You have BIG DATA problemsAndYou have big DATA PROBLEMS
  • The second thing I want to talk about is Hadoop and how Hadoop is setup to deliver Breakthrough Business Insights from your data.How many of you are familiar with Hadoop? How many of you are using Hadoop for projects today?How many are planning on using Hadoop in the next 12mo? How about in the cloud?When people talk about Hadoop they are often talking about specific computational patterns including map reduce, which emerged as a method to process lots of unstructured data on top of a distributed storage system in a highly fault tolerant and embarrassingly scalable way.   Hadoop allows us to store and process large amounts of data on commodity hardware. In the past you would spend large amounts of money on very specialized hardware. Today you can do this with off the shelf hardware running Hadoop. Now, Hadoop doesn’t have a monopoly on “big”, “real time” or “unstructured” but does provide some unique capabilities.  
  • It's everywhere to be mined, but we have what one can call "the pomegranate problem" Imagine all of your data being inside a pomegranate. When you eat a pomegranate it’s a bit difficult getting into all of the little pieces inside the pomegranate out, it's a bit of work.That’s the process that you need to go through to extract business insights out of your data.It’s useful to think of it in this way; where your data is the platform. Not the tooling that surrounds it.It’s all about the data.I’d like to share with you my favorite big data quotation from a famous Big Data philosopher.<next slide>
  • We don't have a Hadoop problem they have analytics, pattern mining, trend analysis, statistical inferenceing, economic modeling, market regression level problems. Big Data; in terms of data size, variability and velocity at scale are is the first problem. But the Big Data solutions and technology by themselves don't lead to solving business objectives. Data science starts where the utility class services like Big Data Hadoop end. The real opportunity is for Data science as a hosted petascale service ontop of cloud infrastructure. As powerful as Hadoop is, today it’s still more of a computer scientist’s or academically-trained analyst’s tool than it is an enterprise analytics product. Hadoop itself is controlled through programming code rather than anything that looks like it was designed for business unit personnel. Hadoop data is often more “raw” and “wild” than data typically fed to data warehouse and OLAP (Online Analytical Processing) systems. This is where I and Microsoft see opportunity.  Essentially; wouldn't it be cool if mere mortals could use this stuff and consume insights that are directly coming from Hadoop?
  • I see the real breakthrough insights coming through when you take what is the traditional "Business Intelligence" and add more capabilities like machine learning, predictive analysis, statistical analysis, large scale graph processing, pattern mining, trend analysis, economic modeling. All of which today are a reality in Hadoop. The implications of this are quite astounding when you think about it. This is huge.
  • Hadoop acm presentation

    1. 1. Hadoop and Microsoft.Brad Sarsfield | Senior Software Engineer @bradoop
    2. 2. How Big is Big Data?
    3. 3. It’s all about yourBig Data Problems
    4. 4. Hadoop is for Big Data.
    5. 5. Data is the Platform.
    6. 6. Hadoop Data Science.
    7. 7. Hadoop Capabilities. Extract Load Distributed Transform Compute Predictive Machine Graph Analysis Learning Processing
    8. 8. Hadoop architecture. Distributed Processing (Map Reduce) Distributed Storage (HDFS)
    9. 9. Hadoop and Microsoft. Big engineering investment • Big Data Business Intelligence tooling • Big Data Apache Hadoop • Big Data Parallel Data Warehouse Open source Commitment • Apache Software Foundation • Hortonworks Partnership We are delivering • Apache Hadoop on Windows Server • Apache Hadoop on Windows Azure
    10. 10. Microsoft Hadoop Vision. Better on Windows and Azure • Active Directory • System Center Microsoft Data Connectivity • SQL Server / SQL Parallel Data Warehouse • Azure Storage / Azure Data Market Microsoft Business Intelligence (BI) • ODBC Connectivity
    11. 11. ACM Hackathon. Free Hadoop on Azure • Code: acmhackathon Free 30 day Azure account • No credit card • 750h small compute / 35GB storage • Email brad@bing.com for code Hadoop on Azure demo