Open Source SW @ IBM Big Data Boulder Java User Group06/11/13Ivan Portillaivanp@firstname.lastname@example.orgRyan DeJanardejana@us.ibm.com- 1 -
Disclaimerü This presentation represents the view of the authorsand does not represent the view of IBM.ü All opinions expressed in this presentation are strictly ofthe speakers, and do NOT represent those of IBM, IBMmanagement, or anyone else.ü IBM and IBM (logo) are trademarks or registeredtrademarks of International Business MachinesCorporation in the United States and/or other countries.ü Many Thanks to Rafael Coss & Paul Zikopoulos for thematerials used in this presentation.
Agendaü Big Dataü OSS in IBM Big Data platformü Demo-3-
Big Data Includes Any of the following Characteristics:Extracting insight in context, beyond what was previously possible.8Manage the complexity ofmultiple relational and non-relational data types andschemasVariety Streaming data and largevolume data movementVelocity Scale from terabytes tozettabytesVolume
Up to10,000TimeslargerUp to 10,000times fasterTraditional DataWarehouse andBusiness IntelligenceDataScaleDataScaleyr mo wk day hr min sec … ms µsExaPetaTeraGigaMegaKiloDecision FrequencyOccasional Frequent Real-timeData in MotionDataatRestBig Data Has New Opportunities But Needs New Analytics-10Telco Promotions100,000 records/sec, 6B/day10 ms/decision270TB for Deep AnalyticsDeepQA100s GB for Deep Analytics3 sec/decisionSmart Traffic250K GPS probes/sec630K segments/sec2 ms/decision, 4K vehiclesHomeland Security600,000 records/sec, 50B/day1-2 ms/decision320TB for Deep Analytics
Applications for Big Data AnalyticsHomeland Security Finance Smarter Healthcare MulM-‐channel sales Telecom Manufacturing Traﬃc Control Trading AnalyMcs Fraud and Risk Log Analysis Search Quality Retail: Churn, NBO
U8li8es § Weather impact analysis on power generaMon § Transmission monitoring § Smart grid management Retail § 360° View of the Customer § Click-‐stream analysis § Real-‐Mme promoMons Law Enforcement § Real-‐Mme mulMmodal surveillance § SituaMonal awareness § Cyber security detecMon Transporta8on § Weather and traﬃc impact on logisMcs and fuel consumpMon § Traﬃc congesMon Financial Services§ Fraud detection§ Risk management§ 360° View of the CustomerIT § System log analysis § Cybersecurity Telecommunica8ons § CDR processing § Churn predicMon § Geomapping / markeMng § Network monitoring Most requested use cases of Big Data12Health & Life Sciences § Epidemic early warning § ICU monitoring § Remote healthcare monitoring Follow this link for details on Industry Big Data use cases
13 § Public wind data is available on 284km x 284 km grids (2.5o LAT/LONG) § More data means more accurate and richer models (adding hundreds of variables) - Vestas wind library at 2.5 PB: to grow to over 6 PB in the near-‐term - Granularity 27km x 27km grids: driving to 9x9, 3x3 to 10m x 10m simulaMons § Reduced turbine placement idenMﬁcaMon from weeks to hours § PerspecMve: The Vestas Wind library, as HD TV would take 70 years to watch 13
14Big Data Analytics in Smarter HospitalsIBM Data Babyyoutube.comBig Data enabled doctors from University of Ontario to apply neonatal infant monitoring to predict infec8on in ICU 24 hours in advance http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
IBM Watson is a breakthrough in analytic innovation, but it is only successfulbecause of the quality of the information from which it is working.-15
-16Big Data and WatsonInfoSphere BigInsightsPOS DataCRM DataSocial MediaDistilled Insight- Spending habits- Social relationships- Buying trendsAdvancedsearch andanalysisWatson can consume insights from Big Data for advanced analysis"Big Data technology is used to buildWatson’s knowledge base"Watson uses the Apache Hadoopopen framework to distribute theworkload for loading information intomemory."Approx. 200M pages of text(To compete on Jeopardy!)Watson’sMemory
IBM is committed to Open Source► Decade of lineage and contributions tothe open source community– Apache Hadoop and Jaql, ApacheDerby, Apache Geronimo, ApacheJakarta, +++– Eclipse: founded by IBM– Significant Lucene contributions via IBMLucene Extension Library (ILEL)– DRDA, XQuery, SQL, XML4J, XERCES,HTTP, Java, Linux, +++► IBM products built on open source– WebSphere: Apache– Rational: Eclipse and Apache– InfoSphere: Eclipse and Apache, +++► IBM’s BigInsights (Hadoop) is 100%open source compatible withno forks
Introducing MapReduce► In 2003 and 2004 Google releases two papers that provide insightinto their success– The Google File System– MapReduce: Simplified Data Processing on Large Clusters► Introduced an approach to large scale data processing known asMapReduceGlobal TLE Framework18
MapReduce► A programming model– Inspired by functional programming– Allows expressing distributed computations on large amounts of data► Execution framework– Designed for large-scale data processing– Designed to run on clusters of commodity hardwareGlobal TLE Framework19
MapReduce, the programming model► Process key-value records► Map function:(Kin, Vin) è list(Kinter, Vinter)► Barrier between map and reduce phases– Shuffle and sort phase moves and groups like keys► Reduce function:(Kinter, list(Vinter)) è list(Kout, Vout)Global TLE Framework20
Pseudocode for word-countGlobal TLE Framework25def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values) Same code can be applied to thousands of lines,even the whole web!Google processes over 20PBs a day, much of it inMapReduce programs.
But what about the data!Global TLE Framework26Compute NodesNASSAN
Distributed file system enables processing tobe moved to the data!Global TLE Framework27(key1, value1)(key2, value2)…(key1, value1)(key2, value2)…Processing is done local to the dataKey-value pairs are processed independently and in parallel!
Hadoop – A M/R Framework► Apache open source software framework for reliable, scalable,distributed computing of massive amount of data§ Hides underlying system details and complexities from user§ Developed in Java► Core sub projects:− MapReduce− Hadoop Distributed File System a.k.a. HDFS− Hadoop Common► Supported by several Hadoop-related projects§ HBase§ Zookeeper§ Avro§ Etc.► Meant for heterogeneous commodity hardware
Hadoop Open Source Projects► Hadoop is supplemented by an ecosystem of open source projectsJaql Oozie
The IBM Big Data Platform32InfoSphere BigInsightsHadoop-based low latencyanalytics for variety and volumeData-At-RestNetezza HighCapacity ApplianceQueryable Archive forStructured DataNetezza 1000BI+Ad Hoc Analytics onStructured DataSmart Analytics SystemOperational Analytics onStructured DataInformix TimeseriesTime-structured analyticsInfoSphere WarehouseLarge volume structured dataanalyticsInfoSphere StreamsLow Latency Analytics forstreaming dataVelocity, Variety & VolumeData-In-MotionMPP Data Warehouse Stream CompuMng InformaMon IntegraMon Hadoop InfoSphere InformationServerHigh volume data integrationand transformationApache Hadoop:open source frameworkfor the distributed processingof large data sets acrossclusters of computers using asimple programming model
The IBM Big Data Platform33Integrate and manage the full variety, velocity and volume of data Apply advanced analy7cs to informa7on in its na7ve form Visualize all available data for ad-‐hoc analysis Development environment for building new analy7c applica7ons Workload op7miza7on and scheduling Security and Governance
BigInsights Brings Hadoop to the Enterprise► BigInsights = analytical platform forpersistent Big Data– Based on open source & IBM technologies– Managed like a start-up . . . . Emphasis ondeep customer engagements, product planflexibility► Distinguishing characteristics– Built-in analytics . . . . Enhances businessknowledge– Enterprise software integration . . . .Complements and extends existingcapabilities– Production-ready platform with tooling foranalysts, developers, andadministrators. . . . Speeds time-to-value;simplifies development and maintenance► IBM advantage– Combination of software, hardware, servicesand advanced researchHadoopSystem
InfoSphere BigInsightsPlatform for volume, variety,velocity► Enhanced HadoopfoundationAnalytics► Text analytics & tooling► Application acceleratorsUsability► Web console► Spreadsheet-style tool► Ready-made “apps”Enterprise Class► Storage, security, clustermanagementIntegration► Connectivity to Netezza,DB2, JDBC databases, etcApacheHadoopBasic EditionEnterprise EditionLicensedApplicaMon accelerators Pre-‐built applicaMons Text analyMcs Spreadsheet-‐style tool RDBMS, warehouse connecMvity AdministraMve tools, security Eclipse development tools Performance enhancements . . . . Free downloadIntegrated installOnline InfoCenterBigData Univ.Breadth of capabilitiesEnterpriseclass
BigInsights Basic EditionConnectivity and integrationJDBCFlumeInfrastructure JaqlHivePigHBaseMapReduceHDFSZooKeeperLuceneOozieOpen Source IBMIntegratedinstallerSqoopHCatalog
Open Source Components AcrossDistributionsComponentBigInsights2.0HortonWorksHDP 1.2MapR2.0GreenplumHD 1.2ClouderaCDH3u5ClouderaCDH4*Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3Avro 1.6.3 X X X X XFlume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1HCatalog 0.4.0 0.5.0 0.4.0 X X XBigInsights con8nues to oﬀer the most proven, stable versions of Apache Hadoop components *Cloudera CDH4 Hadoop 2.0 includes Map Reduce 2.0 which Cloudera states “not yet considered stable”
BigInsights Content (cont’d)FunctionBasicEditionEnterpriseEditionIntegration with R (Jaql module to invoke R statistical capabilities fromBigInsights) n/a IncIntegration with Netezza, DB2 LUW with DPF from Jaql n/a IncLDAP authentication, Guardium support, etc. n/a IncIntegrated Web Console n/a IncBusiness process accelerators (social data, machine data analytics) n/a IncPlatform performance enhancements (Adaptive MapReduce, large scaleindexing, efficient processing of compressed text files, flexible jobscheduler, etc.)n/a IncText analytics n/a IncEclipse tools for text analytic development, Jaql, Hive, Java n/a IncApplications for data import/export, Web crawl, machine learning, etc. n/a IncWeb-based application catalog n/a IncSpreadsheet-like analytical tool n/a IncIBM support Opt IncStreams, Data Explorer, Cognos BI (limited use licenses) n/a IncUnlimited storage n/a Inc
BigInsights: Value Beyond Open SourceEnterprise CapabilitiesAdministration & SecurityWorkload OptimizationConnectorsOpen sourcecomponentsAdvanced EnginesVisualization & ExplorationDevelopment ToolsIBM-certifiedApache Hadoop or or …Key diﬀerenMators • Built-‐in analyMcs • Text engine, annotators, Eclipse tooling • Interface to project R (staMsMcal plamorm) • Enterprise sonware integraMon • Spreadsheet-‐style analysis • Integrated installaMon of supported open source and other components • Web Console for admin and applicaMon access • Plamorm enrichment: addiMonal security, performance features, . . . • World-‐class support • Full open source compaMbility Business beneﬁts • Quicker Mme-‐to-‐value due to IBM technology and support • Reduced operaMonal risk • Enhanced business knowledge with ﬂexible analyMcal plamorm • Leverages and complements exisMng sonware
Big Data Application EcosystemEclipseApp library MapReduce, … Text AnalyMcs Query App Development• Code application program, and generateassociated App• Deploy Apps to Enterprise ManagerApp Development PublishData integra7on scenario: Pre-‐deﬁned work ﬂows simplify loading data from various sources • Work ﬂows can be conﬁgured, deployed, executed and scheduled Development tooling: • Text analyMcs • MapReduce • Query languages • . . . Applica7on scenarios (web log, email, social media, …): • Samples provide starMng point, speed Mme to value Big Data Web Console
Web Console• Manage BigInsightsInspect /monitor system healthAdd / drop nodesStart / stop servicesRun / monitor jobs (applications)Explore / modify file systemCreate custom dashboards. . .• Launch applicationsSpreadsheet-like analysis toolPre-built applications (IBMsupplied or user developed)• Publish applications• Monitor cluster, applications,data, etc.
Running Applications from the Web Console• Import & Export Data • Database & Files • Web and Social • Analyze and Query • Predic7ve Analy7cs • Text Analy7cs • SQL/Hive, Jaql, Pig, HBase
Spreadsheet-style Analysis• Web-based analysis andvisualization• Spreadsheet-likeinterfaceDefine and manage longrunning data collectionjobsAnalyze content of the texton the pages that havebeen retrieved
Get started with BigInsights• In the CloudVia RightScale, or directly on Amazon, Rackspace, IBM Smart EnterpriseCloud, or on private clouds.Pay only for the resources used.• In the ClassroomVia IBM EducationOnline at www.bigdatauniversity.com• On Your ClusterDownload Basic Edition from ibm.com.• With the BigInsights Community– Technical portal @ http://tinyurl.com/biginsights– BigData on DW @ http://ibm.co/bigdatadevLinks to demos, papers, forum, downloads, etc.• Stay connected with IBM Big Data– http://ibmbigdatahub.com
BigDataUniversity.comLearn Big Data Technologies• Flexible on-line deliveryallows learning @your placeand @your pace§ Free courses, free studymaterials.§ Cloud-based sandboxfor exercises – zero setup§ 66666 registered students.§ Robust CourseManagement System andContent Distributioninfrastructure-49
BigInsights and Text Analytics• Distills structured info fromunstructured textSentiment analysisConsumer behaviorIllegal or suspicious activities…• Parses text and detects meaningwith annotators• Understands the context in whichthe text is analyzed• Features pre-built extractors fornames, addresses, phone numbers,etc.• Built-in support for English,Spanish, French, German,Portuguese, Dutch, Japanese,ChineseFootball World Cup 2010, one teamdistinguished themselves well, losing to theeventual champions 1-0 in the Final. Early inthe second half, Netherlands’ striker, ArjenRobben, had a breakaway, but the keeper forSpain, Iker Casillas made the save. WingerAndres Iniesta scored for Spain for the win.Unstructured text (document, email, etc)Classification and Insight
Example Analysis : Extraction from TwittermessagesExtract intent, interests, life events and micro segmentation attributesIm at Mickeys Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 othershttp://4sq.com/gbsaYR @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;)btw happy birthday Sylvia ;)@rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyleI had an iphone, but its dead @JoaoVianaa. (Ive no idea where its) !Want a blackberrynow !!!Monetizable IntentRelocationLocationName, Birth DaySubtle Spam,AdvertisingSarcasm,Wishful ThinkingWhile accounting for less relevant messagesI think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on ituneshttp://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile@purplepleather Gotta do more research my Versace term paper 2day. Before I die, Iwant a versace purple diamond tiara. Im just sayin>lolhad so much fun today! I want to buy a million dollar house with a wrap aroundporch ... ... wading river on the long island sound, ha i wish!