Your SlideShare is downloading. ×
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Gilbane Boston 2011 big data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Gilbane Boston 2011 big data

2,811

Published on

"Get Ready for Big Data" presentation from Gilbane Boston 2011; for more details, see http://gilbaneboston.com/conference_program.html#t2 and …

"Get Ready for Big Data" presentation from Gilbane Boston 2011; for more details, see http://gilbaneboston.com/conference_program.html#t2 and http://pbokelly.blogspot.com/2011/12/gilbane-boston-2011-big-data.html

Published in: Technology, Education
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
2,811
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
45
Comments
1
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • If fairness, the Wikipedia article, as of 2011117, also noted “This article appears to be in both diffused categories and their subcategories, or has an overbroad categorization, and may need cleanup.”
  • Image source: http://www.nasa.gov/audience/forstudents/5-8/features/what-is-a-black-hole-58.html
  • This is a high-level dichotomy – and not meant to be precise or mutually-exclusive (i.e., some info items have both resource and relation attributes)
  • This is meant to be illustrative – neither precise nor exhaustive
  • Point of having a merged cell for physical: it’s all coming together – it’s increasingly difficult to distinguish the underlying physical model services…Hypertext is not 1:1 with HTML – it’s beyond-the-basics hypertext as manifested, e.g., in Web publishing and collaboration-oriented systems/serversXQuery is not mainstream today, but it is exceptionally powerful and was co-developed in conjunction with XPath 2.0
  • Captured 20111117Wikipedia also notes “This article provides insufficient context for those unfamiliar with the subject. Please help improve the article with a good introductory style. (October 2011)”
  • NoSQL is sometimes also associated with open source DBMS, adding more confusion
  • Image source: http://hadoop.apache.org/
  • Image sources: http://hadoop.apache.org/http://www.slideshare.net/cloudera/tokyo-nosqlslidesonly?from=ss_embedRelated vendor press releases:http://www.asterdata.com/news/091001-Aster-Hadoop-connector.phphttp://www.emc.com/about/news/press/2011/20110509-03.htm http://www.vertica.com/2010/10/10/vertica-4-0-connector-for-hadoop/http://thinking.netezza.com/blog/hadoop-netezza-synergy-data-analytics-results-new-customer-deployment-trends-part-1http://developer.teradata.com/tag/hadoophttp://www.informatica.com/news_events/press_releases/Pages/11012010_cloudera.aspxhttp://www.zdnet.com/blog/microsoft/microsoft-drops-dryad-puts-its-big-data-bets-on-hadoop/11226
  • Bottom row is not meant to imply RDBMS doesn’t offer indexing – rather that current leading RDBMSs don’t offer 100% XDBMS features
  • Source:http://www.forrester.com/rb/Research/stay_alert_to_database_technology_innovation/q/id/57947/t/2
  • Continuing from research program @ UC Berkeley on size of the web
  • More than doubling every 2 years – 51% CAGRZettabyte – 1000 ExabytesExabyte = 1000 PetabytesPetabyte = 1000 TerabytesTerabyte = 1000 Gigabytes(so Zettabyte = 1 000 000 000 gigabytes)
  • This is growth for enterprise info under management only…Not consumer data, RFID, even web log data etc, unless explicitly managed.Industry consensus estimate: 80% Unstructured Info vs. 20% structured
  • More info is digital to begin withMore file types – e.g. video, collaborative, mobile, presence, etc.More digital activity as part of the job/task environment
  • In raw state, the digital world looks about as organized as this view of a piece of the universe.The first question has to be: what's out there?So search is really the first pattern in the technologies of unstructured information.
  • Note more than a doubling in the past year…..
  • This is a historical view of the largest web search indexes of their day.
  • Social technology has brought a new pattern – the social graph – to the technologies of unstructured information.Now it's not just documents and links, but people and their relationships and their friends relationships.Facebook today
  • Go to this link to see your own map of LinkedIn connections.
  • If you were Google, would you like to own that data?If you were Facebook, would you consider marketing a smartphone product?If you were Microsoft, you might buy a mobile phone company to get into the market, if you only failed building your own ecosystem.
  • Attensity screen shots – text analytics for web-wide monitoring of blogs, social networks, YouTube videos, traditional media, with purpose-built analytics and dashboards for timely/high availability decisions.
  • Emphasis on guidance rather than discovery.Personal assistance – e.g. Siri & her successorsHealthcareCustomer serviceCustomer Self-service for complex products – e.g. investments, insurance, etc.MaintenanceIT service environmentsMany more…
  • …in the period between 2010 and 2015.
  • Transcript

    • 1. Get Ready for Big Data Wednesday November 30, 2011 2:40 – 4:00Peter OKelly Principal Analyst, OKelly AssociatesHadley Reynolds Managing Director, Next Era ResearchKathleen Reidy Senior Analyst, 451 Research
    • 2. Agenda• Big data in context• Big structured data• Big unstructured data• Big opportunities and risks• Q&A 2
    • 3. Big Data in Context• What is “big data”? – Unhelpfully, both “big data” and “NoSQL,” generally considered a key part of the big data wave, are defined more in terms of what they’re not than what they are – A typical big data definition (Wikipedia): • “*…+ datasets that grow so large that they become awkward to work with using on-hand database management tools” 3
    • 4. Big Data in Context• With thanks to the Business SOA blog: – “*…+ describe Big Data in the same way that the Hitchhikers Guide to the Galaxy described space: – ‘Space,’ it says, ‘is big. Really big. You just wont believe how vastly, hugely, mindbogglingly big it is. I mean, you may think its a long way down the road to the chemists, but thats just peanuts to space, listen...’” 4
    • 5. Big Data in Context• Why is big data a big deal now? – Commodity hardware and the Internet • Capability and price/performance curves that continue to defy all economic “laws” • Also facilitating compelling cloud services – Maturation and uptake of open source software, e.g., Hadoop • Powerful and often no- or low-cost – IT market • Enthusiasm for “NoSQL” systems • Frustration with incumbent information management vendors – Useful new data sources/resources, e.g., social network activity graphs, the “Internet of things,” sensor networks… – Competitive and compliance imperatives 5
    • 6. Big Data in Context• A big data reality check – “Mindbogglingly”-scale information management is not new • Consider, e.g., VLDB, multi-billion document repositories, and the World Wide Web… – What is new and compelling • The combination of market dynamics producing new capability and price/performance curves • Cloud – No deep capital investment required to get started – Cloud-based information resources • Some innovative marketing, suggesting – Self-proclaimed next-generation big data systems are magical and revolutionary – Deployed systems are obsolete and wasteful 6
    • 7. A Big-Picture Framework• A digital information item dichotomy – Resources (~unstructured information) • Digital artifacts optimized to convey stories – Organized in terms of narrative, hierarchy, and sequence • Examples: books, magazines, documents (e.g., PDF, Word), Web pages, XBRL documents, video, hypertext… – Relations (~structured information) • Application-independent descriptions of real-world things and relationships • Examples: business domain databases, e.g., customer, sales, HR… 7
    • 8. A Big-Picture Framework Resource Relation 8
    • 9. A Big-Picture Framework Resources RelationsConceptual Resources and links Entities, attributes, relationships, and identifiersLogical Model: hypertext Model: extended relational Language: XQuery (ideally) Language: SQLPhysical Indexing (e.g., scalar data types, XML, full-text), locking and isolation levels, federation, replication, in-memory databases, columnar storage, table spaces, caching, and more 9
    • 10. Agenda• Big data in context• Big structured data• Big unstructured data• Big opportunities and risks• Q&A 10
    • 11. Big Structured Data• NoSQL• Hadoop• RDBMS reconsidered• Back to the bigger picture 11
    • 12. NoSQL• No clear consensus on what “NoSQL” means – Started with what it’s against, not what it’s about • And often finds a receptive audience due to frustration with RDBMS business-as-usual – The “NoSQL” meme is a moving target • Initially implied “Just say ‘no’ to SQL” • Later quietly redefined as “Not Only SQL” • What may be next: “New Opportunities for SQL” – I.e., some developers may reconsider the value of SQL and RDBMSs, after hitting NoSQL limitations 12
    • 13. A NoSQL Taxonomy• From the NoSQL Wikipedia article: 13
    • 14. NoSQL Perspectives• The “NoSQL” meme confusingly conflates – Document database requirements • Best served by XML DBMS (XDBMS) – Physical model decisions on which only DBAs and systems architects should focus • And which are more complementary than competitive with RDBMS/XDBMS – Object databases, which have floundered for decades • But with which some application developers are nonetheless enamored, for minimized “impedance mismatch,” despite significant information management compromises – Semantic models • Also more complementary than competitive with RDBMS/XDBMS 14
    • 15. Hadoop• Hadoop is often considered central to big data – Originating with Google’s MapReduce architecture, Apache Hadoop is an open source architecture for distributed processing on networks of commodity hardware• Commercial application domains include (from Wikipedia) – Log and/or clickstream analysis of various kinds – Marketing analytics – Machine learning and/or sophisticated data mining – Image processing – Processing of XML messages – Web crawling and/or text processing – General archiving, including of relational/tabular data, e.g. for compliance 15
    • 16. Hadoop• Hadoop is popular and rapidly evolving – Most leading information management vendors, including Microsoft, have embraced Hadoop – There is now a Hadoop ecosystem 16
    • 17. RDBMS Reconsidered• RDBMS incumbents appear to be under siege, with – IT frustration with RDBMS business-as-usual • Counterproductive RDBMS vendor policies and attitudes • DBA modus operandi often seen as excessively conservative – Conventional wisdom about RDBMS limitations for, e.g., • “Web scale” • “Agility” • The application/database “impedance mismatch” – The advent of open source and/or specialized DBMSs • E.g., MySQL is the M in the “LAMP stack” • “The end of the one-size-fits-all DBMS era” 17
    • 18. RDBMS Reconsidered• An RDBMS reality check – Leading RDBMS products and open source initiatives are very powerful and flexible • And will continue to evolve, e.g., with the mainstream deployment of massive-memory servers and solid state disk (SSD) storage – And they continue to expand • E.g., in-database processing, with, for example, analytics engines running within DBMS kernels – But the RDBMS incumbents nonetheless face unprecedented challenges • Which sometimes resonate with frustrated architects and developers because of negative experiences that have more to do with how RDBMSs were used rather than what RDBMSs can effectively address 18
    • 19. RDBMS in the Big-Picture Framework Resources RelationsConceptual Resources and links Entities, attributes, relationships, and identifiersLogical Model: hypertext Model: extended relational Language: XQuery Language: SQLPhysical Indexing (e.g., scalar data types, XML, full-text), locking and isolation levels, federation, replication, in-memory databases, columnar storage, table spaces, caching, and more 19
    • 20. RDBMS Reconsidered• A Forrester big data reality check (from “Stay Alert To Database Technology Innovation,” 11/19/2010): – “For 90% of BI use cases, which are often less than 50 terabytes in size, relational databases still are good enough” (p. 4) – “Traditional relational databases are still good enough for the majority of transactional use cases” (p. 5) 20
    • 21. Back to the Bigger Picture• Compared with traditional enterprise data management, big data is – Essentially a collection of specialized physical models for very large, analysis-oriented data management – Expanding to encompass resources as well as relations – More about the potential for displacing expensive and closed/proprietary distributed processing alternatives than displacing RDBMS or XDBMS 21
    • 22. Structured Big Data: Recap• Substantive, sustainable, and synergistic – RDBMS – XDBMS – Hadoop – The cloud as an information management platform• Vaguely defined, transitory, and over-hyped – NoSQL 22
    • 23. Agenda• Big data in context• Big structured data• Big unstructured data• Big opportunities and risks• Q&A 23
    • 24. Big Unstructured Data• Finding Facts about Data – IDC/EMC• Patterns for Unstructured Big Data• How-to issues – who will know? 24
    • 25. http://www.emc.com/leadership/programs/digital-universe.htm 25
    • 26. 26
    • 27. 27
    • 28. 4/28/2011 28
    • 29. 29
    • 30. 30
    • 31. 4/28/2011 31
    • 32. 32
    • 33. 33
    • 34. Facebook:800M users500M visitors/day 34$100B potential value @ IPO
    • 35. http://inmaps.linkedinlabs.com/ 35
    • 36. Unstructured Big Data Patterns• Search• Social• Mobile• Online Activities/Digital Marketing• Inquiry/Detection – Connecting Dots• Question Answering 36
    • 37. Mobile Adds:Location data pointsVoice searchesSiri questionsApp history profileBrowse history profileSearch history profilePast purchase profileCamera-generated outputs/inputsCoupon delivery & merchandisingFriends locationsSocial searchLocal ad-match algo opportunities 37
    • 38. 4/28/2011 38
    • 39. Online Activities/Digital Marketing 39
    • 40. • Inquiry/Detection – Connecting Dots – Intelligence – Law Enforcement – Fraud Detection (Government, Financial, Health, …) – eDiscovery 40
    • 41. Social Media Monitoring 41
    • 42. Question Answering 4/28/2011 42
    • 43. Question Answering Beyond Jeopardy 43
    • 44. Twitter Analytics Questions• What can we tell about a user from their tweets? – from the tweets of those they follow? – from the tweets of their followers? – from the ratio of followers/following• What graph structures lead to successful networks?• User reputation?• Sentiment analysis?• What features get a tweet retweeted? – How deep is the retweet tree?• Long term duplicate detection• Machine learning• Language detection 44
    • 45. 45
    • 46. 46http://www.mckinsey.com/en/Features/Big_Data.aspx
    • 47. Agenda• Big data in context• Big structured data• Big unstructured data• Big opportunities and risks• Q&A 47
    • 48. Big Data Opportunities• Improved visibility and insights – Can explore previously impractical questions• Real-time analytics – Less dependence on “dead data”• Blur the boundaries between structured and unstructured information – Unified views of resources and relations• Consolidation – Reduce the number of moving parts in your infrastructure • Along with related licensing and maintenance expenses• Compliance – capture and maintain data & records previously beyond firms capabilities 48
    • 49. Big Data Risks• The potential for an ever-expanding set of information silos – Critical to relentlessly focus on minimized redundancy and optimized integration• GIGO (garbage in, garbage out) at super-scale – Dramatic improvements in capabilities and price/performance provide new opportunities for self-inflicted damage, for organizations that don’t model or query effectively• Cognitive overreach – The potential for information workers to create nonsensical queries based on poorly-designed and/or misunderstood information models• Skills gaps create competitive disadvantages 49
    • 50. Q&APeter OKelly - peter@okellyassociates.comKathleen Reidy - kathleen.reidy@451Research.comHadley Reynolds - hadley.reynolds@nexteraresearch.com 50
    • 51. Database market landscape Relational Analytic Mapr Infobright Netezza ParAccel SAP Sybase IQ Non-relational Piccolo Hadoop Teradata EMC IBM InfoSphere Dryad Brisk Greenplum Hadapt Aster Data Calpont VectorWise HP Vertica Operational Progress Oracle IBM DB2 SQL Server JustOne InterSystems MarkLogic MySQL Ingres PostgreSQL Objectivity Document Lotus Notes McObject SAP Sybase ASE EnterpriseDB Versant NoSQL CouchDB NewSQL HandlerSocket Akiban Key value MongoDB -as-a-Service MySQL Cluster Amazon RDS Couchbase RavenDB Cloudant App Engine SQL Azure Clustrix Riak Datastore Database.com Redis Drizzle Big tables Xeround FathomDB GenieDB Membrain SimpleDB ScalArc Cassandra Voldemort Hypertable Graph Schooner MySQL CodeFutures InfiniteGraph Tokutek ScaleBase NimbusDB BerkeleyDB HBase Neo4J Continuent GraphDB Translattice VoltDBData Grid/Cache Terracotta GigaSpaces Oracle Coherence Memcached IBM eXtreme Scale GridGain ScaleOut Vmware GemFire InfiniSpan CloudTran
    • 52. Big Data Complexity Continuum Climate Modeling Gov’t Intelligence And Prediction Applications Predictions Trend Analytics MedicalNumber & Complexity of Technologies diagnostics Fraud Detection Influence Voice of Customer Networks Sentiment extraction Relationship Ad Targeting Reputation Retargeting Detection management Brand monitoring Intelligent Web search Machines Pattern Log Analysis Data mining eCommerce Detection Speech to text Time Historic Future(Predict) 52 Current (Monitor) Horizon IDC 2005
    • 53. Big Data Characteristics Velocity Value Big Data Variety/ Volume Complexity© IDC 12/2/2011

    ×