Steve Watt Presentation


Published on

Steve Watt provides an introduction to big data and how it can be used to provide answers to real-life questions.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • As Hardware becomes increasing commoditized, the margin & differentiation moved to software, as software is becoming increasingly commoditized the margin & differentiation is moving to data2000 - Cloud is an IT Sourcing Alternative (Virtualization extends into Cloud)Explosion of Unstructured DataMobile“Let’s create a context in which to think….”Focused on 3 major tipping points in the evolution of the technology. Mention that this is a very web centric view contrasted to Barry Devlin’s Enterprise viewAssumes Networking falls under Hardware & Cloud is at the Intersection of Software and DataWhy should you care?Tipping Point 1: Situational ApplicationsTipping Point 2: Big DataTipping Point 3: Reasoning
  • Web 2.0(Information Explosion, Now Many Channels - Turning consumers into Producers (Shirky),Tipping point Web Standards allow Rapid Application Development, Advent of Situational Applications, Folksonomies,Social)SOA (Functionality exposed through open interfaces and open standards, Great strides in modularity and re-use whilst reducing complexities around system integration, Still need to be a developer to create applications using theseservice interfaces (WSDL, SOAP, way too complex !) Enter mashups…)Mashups (Place a façade on the service and you have the final step in the evolution of services and service based applications,Now anyone can build applications (i.e. non-programmers). We’ve taken the entire SOA Library and exposed it to non-programmers, What do I mean? Check out this YouTunes app…) 1st example where we saw arbitrary data/content re-purposed in ways the original authors never intended –eg. Craigslist gumtree/ homes for sales scraped and placed on google map mashed up w/ crime statistics. Whole greater than the sum of its parts -> New kinds of Information !!BUT Limitations around how much arbitrary data being scraped and turned into info. Usually no pre-processing and just what can be rendered on a single page.Demo
  • “Every 2 days we create as much data as we did from the dawn of humanity until 2003” – We’ve hit the Petabyte & Exabyte age. What does that mean? Lets look (next slide)
  • Mention Enterprise Growth over time, Mobile/Sensor Data, Web 2.0 Data Exhaust, Social NetworksAdvances in Analytics – keep your data around for deeper business insights and to avoid Enterprise Amnesia
  • How about we summarize a few of the key trends in the Web as we know it today …. This diagram shows some of the main trends of what Web 3.0 is about…Netflix accounts for 29.7 % of US Traffic, Mention Web 2.0 Summit Points of ControlHaving more data leads to better context which leads to deeper understanding/insight or new discoveriesRefer to Reid Hoffman’s views on what web 3.0 is
  • Pre-processed though, not flexible, you can’t ask specific questions that have not been pre-processed
  • Mention folksonomies in Web 2.0 with searching Delicious Bookmarks. Mention Chilean Earthquake Crisis Video using Twitter to do Crisis Mapping.
  • Talk about Visualizations and InfoGraphics – manual and a lot of work
  • They are only part of the solution & don’t allow you to ask your own questions
  • This is the real promise of Big Data
  • These are not all the problems around Big Data. These are the bigger problems around deriving new information out of web data. There are other issues as well likely inconsistency, skew, etc.
  • Give a Nutch example
  • Specifically call out the color coding reasoning for Map/Reduce and HDFS as a single distributed service
  • Give examples of how one might use Open Calais or Entity Extraction libraries
  • Steve Watt Presentation

    1. 1. Big Data Steve Watt Emerging Technologies @ HP 1– Someday Soon (Flickr)
    2. 2. 2– timsnell (Flickr)
    3. 3. Agenda Hardware Software Data • Big Data • Situational Applications3
    4. 4. Situational Applications 4– eaghra (Flickr)
    5. 5. Web 2.0 Era Topic Map Produce Process Inexpensiv Data e Storage Explosion LAM Social PPlatform Publishin s g Platforms Situational Applications Web 2.0 Mashups Enterpris SOA e5
    6. 6. 6
    7. 7. Big Data 7– blmiers2 (Flickr)
    8. 8. The data just keeps growing… 1024 GIGABYTE= 1 TERABYTE 1024 TERABYTES = 1 PETABYTE 1024 PETABYTES = 1 EXABYTE1 PETABYTE 13.3 Years of HD Video20 PETABYTES Amount of Data processed by Google daily5 EXABYTES All words ever spoken by humanity
    9. 9. Mobile App Economy for Devices Sensor Web App for this App for that An instrumented and monitored worldSet Top Tablets, etc. Multiple Sensors in your pocketBoxes Real-time Data The Fractured Web Opportunity Facebook Twitter LinkedInService EconomyService for this Google NetFlix New York TimesService for that eBay Pandora PayPal Web 2.0 Data Exhaust of Historical and Real-time Data Web 2.0 - Connecting People API Foundation Web as a Platform 9 Web 1.0 - Connecting Machines Infrastructure
    10. 10. Data Deluge! But filter patterns can help… 10Kakadu (Flickr)
    11. 11. FilteringWithSearch 11
    12. 12. FilteringSocially Awesome 12
    13. 13. FilteringVisually 13
    14. 14. But filter patterns force you down a pre-processed pathM.V. Jantzen (Flickr)
    15. 15. What if you could ask your own questions? 15– wowwzers(Flickr)
    16. 16. And go from discovering Something about Everything…– MrB-MMX (Flickr)
    17. 17. To discovering Everything about Something ?17
    18. 18. How do we do this? Lets examine a few techniques forGathering, Storing, Processing &18 Delivering Data @ Scale
    19. 19. Gathering DataData Marketplaces 19
    20. 20. 20
    21. 21. 21
    22. 22. Gathering DataApache Nutch(Web Crawler) 22
    23. 23. Storing, Reading and Processing - Apache Hadoop Cluster technology with a single master and scale out with multiple slaves It consists of two runtimes:  The Hadoop Distributed File System (HDFS)  Map/Reduce As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster. Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re- execute a job on any node in the cluster. Want to know more?23 “Hadoop – The Definitive Guide (2nd Edition)”
    24. 24. Delivering Data @ Scale• Structured Data• Low Latency & Random Access• Column Stores (Apache HBase or Apache Cassandra) • faster seeks • better compression • simpler scale out • De-normalized – Data is written as it is intended to be queried Want to know more?24 “HBase – The Definitive Guide” & “Cassandra High Performance
    25. 25. Storing, Processing & Delivering : Hadoop + NoSQL Gather Read/Transfor Low- m latency Application Web Data Nutch Query Crawl Serve Copy Apache Hadoop Log Files Flume Connector HDFS NoSQL Repository NoSQL SQOOP Connector/A Connector PI Relational Data -Clean and Filter Data (JDBC) - Transform and Enrich Data MySQL - Often multiple Hadoop jobs 25
    26. 26. Some things to keep in mind… 26– Kanaka Menehune (Flickr)
    27. 27. Some things to keep in mind…• Processing arbitrary types of data (unstructured, semi- structured, structured) requires normalizing data with many different kinds of readers Hadoop is really great at this !• However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard. Consider using parsing services & APIs like Open Calais Want to know more?27 “Programming Pig” (O’REILLY)
    28. 28. Open Calais (Gnosis)28
    29. 29. Statistical real-time decision making  Capture Historical information  Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)  Mesh real-time events (such as sensor data) against Models to make automated decisions Want to know more?29 “Mahout in Action”
    30. 30. 30Pascal Terjan (Flickr
    31. 31. 31
    32. 32. 32
    33. 33. Using ApacheIdentify Optimal Seed URLs for a Seed List & Crawl to a depth of 2For example: data is stored in sequence files in the segments dir on the HDFS 33
    34. 34. 34
    35. 35. Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out35
    36. 36. Aargh!My viz toolrequireszipcodes to plotgeospatially! 36
    37. 37. Apache Pig Script to Join on City to get ZipCode and Write the results to VerticaZipCodes = LOAD demo/zipcodes.txt USING PigStorage(t) AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD demo/crunchbase.txt USING PigStorage(t) AS(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORE CrunchBaseZip INTO{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Yearint, Investor int, Amount varchar(40))}’USING com.vertica.pig.VerticaStorer(‘VerticaServer,OSCON,5433,dbadmin,);
    38. 38. Total Tech Investments By Year
    39. 39. Investment Funding By Sector
    40. 40. Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.7 Billion in Austin40
    41. 41. Total Investments By Zip Code for Consumer Web $600 Million in Seattle $1.2 Billion in Chicago $1.7 Billion in San Francisco41
    42. 42. Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego42
    43. 43. Questions? Steve Watt @wattsteve stevewatt.blogspot.com43