Uploaded on

Presentation by Ivan Portilla and Ryan Dejana at Boulder Java User's Group, June 11, 2013. See http://boulderjug.org for more details.

Presentation by Ivan Portilla and Ryan Dejana at Boulder Java User's Group, June 11, 2013. See http://boulderjug.org for more details.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,295
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
100
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Open  Source  SW  @  IBM  Big  Data  Boulder Java User Group06/11/13Ivan Portillaivanp@us.ibm.comportilla@gmail.comRyan DeJanardejana@us.ibm.com- 1 -
  • 2. Disclaimerü  This presentation represents the view of the authorsand does not represent the view of IBM.ü  All opinions expressed in this presentation are strictly ofthe speakers, and do NOT represent those of IBM, IBMmanagement, or anyone else.ü  IBM and IBM (logo) are trademarks or registeredtrademarks of International Business MachinesCorporation in the United States and/or other countries.ü  Many Thanks to Rafael Coss & Paul Zikopoulos for thematerials used in this presentation.
  • 3. Agendaü  Big Dataü  OSS in IBM Big Data platformü  Demo-3-
  • 4. 4
  • 5. 5
  • 6. Big DataSize Equivalence6Name   Value   RAMAC   IPOD  1  Giga  (GB)   10^9   200  1  Tera  (TB)   10^12   200K   200  1  Peta  (PB)     10^15   200M   200K  1  Exa  (EB)   10^18   200B   200M  1  ZeEa  (ZB)   10^21   200T   200B  
  • 7. Why Didn’t We Use All of the Big Data Before?
  • 8. Big Data Includes Any of the following Characteristics:Extracting insight in context, beyond what was previously possible.8Manage the complexity ofmultiple relational and non-relational data types andschemasVariety  Streaming data and largevolume data movementVelocity  Scale from terabytes tozettabytesVolume  
  • 9. Veracity-9-
  • 10. Up to10,000TimeslargerUp to 10,000times fasterTraditional DataWarehouse andBusiness IntelligenceDataScaleDataScaleyr mo wk day hr min sec … ms µsExaPetaTeraGigaMegaKiloDecision FrequencyOccasional Frequent Real-timeData in MotionDataatRestBig Data Has New Opportunities But Needs New Analytics-10Telco Promotions100,000 records/sec, 6B/day10 ms/decision270TB for Deep AnalyticsDeepQA100s GB for Deep Analytics3 sec/decisionSmart Traffic250K GPS probes/sec630K segments/sec2 ms/decision, 4K vehiclesHomeland Security600,000 records/sec, 50B/day1-2 ms/decision320TB for Deep Analytics
  • 11. Applications for Big Data AnalyticsHomeland  Security  Finance    Smarter  Healthcare   MulM-­‐channel  sales  Telecom  Manufacturing  Traffic  Control  Trading  AnalyMcs   Fraud  and  Risk  Log  Analysis  Search  Quality  Retail:  Churn,  NBO  
  • 12. U8li8es  §  Weather  impact  analysis  on  power  generaMon  §  Transmission  monitoring  §  Smart  grid  management  Retail  §  360°  View  of  the  Customer  §  Click-­‐stream  analysis  §  Real-­‐Mme  promoMons  Law  Enforcement  §  Real-­‐Mme  mulMmodal  surveillance  §  SituaMonal  awareness  §  Cyber  security  detecMon  Transporta8on  §  Weather  and  traffic  impact  on  logisMcs  and  fuel  consumpMon  §  Traffic  congesMon  Financial Services§  Fraud detection§  Risk management§  360° View of the CustomerIT  §  System  log  analysis  §  Cybersecurity  Telecommunica8ons  §  CDR  processing  §  Churn  predicMon  §  Geomapping  /  markeMng  §  Network  monitoring  Most requested use cases of Big Data12Health  &  Life  Sciences  §  Epidemic  early  warning  §  ICU  monitoring  §  Remote  healthcare  monitoring  Follow this link for details on Industry Big Data use cases
  • 13. 13  § Public  wind  data  is  available  on  284km  x  284  km  grids  (2.5o  LAT/LONG)  § More  data  means  more  accurate  and  richer  models  (adding  hundreds  of  variables)  -  Vestas  wind  library  at  2.5  PB:  to  grow  to  over  6  PB  in  the  near-­‐term  -  Granularity  27km  x  27km  grids:  driving  to  9x9,  3x3  to  10m  x  10m  simulaMons  § Reduced  turbine  placement  idenMficaMon  from  weeks  to  hours  § PerspecMve:  The  Vestas  Wind  library,  as  HD  TV  would  take  70  years  to  watch  13  
  • 14. 14Big Data Analytics in Smarter HospitalsIBM Data Babyyoutube.comBig  Data  enabled  doctors  from  University  of  Ontario  to  apply  neonatal  infant  monitoring  to  predict  infec8on  in  ICU  24  hours  in  advance    http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
  • 15. IBM Watson is a breakthrough in analytic innovation, but it is only successfulbecause of the quality of the information from which it is working.-15
  • 16. -16Big Data and WatsonInfoSphere BigInsightsPOS DataCRM DataSocial MediaDistilled Insight-  Spending habits-  Social relationships-  Buying trendsAdvancedsearch andanalysisWatson can consume insights from
Big Data for advanced analysis"Big Data technology is used to buildWatson’s knowledge base"Watson uses the Apache Hadoopopen framework to distribute theworkload for loading information intomemory."Approx. 200M pages of text(To compete on Jeopardy!)Watson’sMemory
  • 17. IBM is committed to Open Source►  Decade of lineage and contributions tothe open source community– Apache Hadoop and Jaql, ApacheDerby, Apache Geronimo, ApacheJakarta, +++– Eclipse: founded by IBM– Significant Lucene contributions via IBMLucene Extension Library (ILEL)– DRDA, XQuery, SQL, XML4J, XERCES,HTTP, Java, Linux, +++►  IBM products built on open source– WebSphere: Apache– Rational: Eclipse and Apache– InfoSphere: Eclipse and Apache, +++►  IBM’s BigInsights (Hadoop) is 100%open source compatible withno forks
  • 18. Introducing MapReduce►  In 2003 and 2004 Google releases two papers that provide insightinto their success– The Google File System– MapReduce: Simplified Data Processing on Large Clusters►  Introduced an approach to large scale data processing known asMapReduceGlobal TLE Framework18
  • 19. MapReduce►  A programming model– Inspired by functional programming– Allows expressing distributed computations on large amounts of data►  Execution framework– Designed for large-scale data processing– Designed to run on clusters of commodity hardwareGlobal TLE Framework19
  • 20. MapReduce, the programming model►  Process key-value records►  Map function:(Kin, Vin) è list(Kinter, Vinter)►  Barrier between map and reduce phases– Shuffle and sort phase moves and groups like keys►  Reduce function:(Kinter, list(Vinter)) è list(Kout, Vout)Global TLE Framework20
  • 21. Map phase, word-count exampleGlobal TLE Framework21(line1, “Hello there.”)(line2, “Why, hello.”)(“hello”,1)  (“there”,1)  (“why”,1)  (“hello”,1)  
  • 22. Sort phase, word-count exampleGlobal TLE Framework22(“hello”, 1)(“hello”, 1)(“there”,  1)  (“why”,  1)  
  • 23. Reduce phase, word-count exampleGlobal TLE Framework23(“hello”, 1)(“hello”, 1)(“there”,  1)  (“why”,  1)  (“hello”, 2)(“there”, 1)(“why”, 1)
  • 24. MapReduce, end to endGlobal TLE Framework24
  • 25. Pseudocode for word-countGlobal TLE Framework25def  mapper(line):      foreach  word  in  line.split():          output(word,  1)    def  reducer(key,  values):      output(key,  sum(values)  Same code can be applied to thousands of lines,even the whole web!Google processes over 20PBs a day, much of it inMapReduce programs.
  • 26. But what about the data!Global TLE Framework26Compute NodesNASSAN
  • 27. Distributed file system enables processing tobe moved to the data!Global TLE Framework27(key1, value1)(key2, value2)…(key1, value1)(key2, value2)…Processing is done local to the dataKey-value pairs are processed independently and in parallel!
  • 28. Hadoop – A M/R Framework►  Apache open source software framework for reliable, scalable,distributed computing of massive amount of data§ Hides underlying system details and complexities from user§ Developed in Java►  Core sub projects:− MapReduce− Hadoop Distributed File System a.k.a. HDFS− Hadoop Common►  Supported by several Hadoop-related projects§ HBase§ Zookeeper§ Avro§ Etc.►  Meant for heterogeneous commodity hardware
  • 29. Hadoop ArchitectureGlobal TLE Framework29
  • 30. Who uses Hadoop?
  • 31. Hadoop Open Source Projects►  Hadoop is supplemented by an ecosystem of open source projectsJaql  Oozie  
  • 32. The IBM Big Data Platform32InfoSphere BigInsightsHadoop-based low latencyanalytics for variety and volumeData-At-RestNetezza HighCapacity ApplianceQueryable Archive forStructured DataNetezza 1000BI+Ad Hoc Analytics onStructured DataSmart Analytics SystemOperational Analytics onStructured DataInformix TimeseriesTime-structured analyticsInfoSphere WarehouseLarge volume structured dataanalyticsInfoSphere StreamsLow Latency Analytics forstreaming dataVelocity, Variety & VolumeData-In-MotionMPP  Data  Warehouse  Stream  CompuMng  InformaMon  IntegraMon  Hadoop  InfoSphere InformationServerHigh volume data integrationand transformationApache Hadoop:open source frameworkfor the distributed processingof large data sets acrossclusters of computers using asimple programming model
  • 33. The IBM Big Data Platform33Integrate  and  manage  the  full  variety,  velocity  and  volume  of  data      Apply  advanced  analy7cs  to  informa7on  in  its  na7ve  form      Visualize  all  available  data  for  ad-­‐hoc  analysis  Development  environment  for  building  new  analy7c  applica7ons      Workload  op7miza7on  and  scheduling        Security  and  Governance  
  • 34. BigInsights Brings Hadoop to the Enterprise►  BigInsights = analytical platform forpersistent Big Data–  Based on open source & IBM technologies–  Managed like a start-up . . . . Emphasis ondeep customer engagements, product planflexibility►  Distinguishing characteristics– Built-in analytics . . . . Enhances businessknowledge– Enterprise software integration . . . .Complements and extends existingcapabilities– Production-ready platform with tooling foranalysts, developers, andadministrators. . . . Speeds time-to-value;simplifies development and maintenance►  IBM advantage– Combination of software, hardware, servicesand advanced researchHadoopSystem
  • 35. InfoSphere BigInsightsPlatform for volume, variety,velocity►  Enhanced HadoopfoundationAnalytics►  Text analytics & tooling►  Application acceleratorsUsability►  Web console►  Spreadsheet-style tool►  Ready-made “apps”Enterprise Class►  Storage, security, clustermanagementIntegration►  Connectivity to Netezza,DB2, JDBC databases, etcApacheHadoopBasic EditionEnterprise EditionLicensedApplicaMon  accelerators    Pre-­‐built  applicaMons  Text  analyMcs    Spreadsheet-­‐style  tool  RDBMS,  warehouse  connecMvity    AdministraMve  tools,  security  Eclipse  development  tools  Performance  enhancements  .  .  .  .                Free downloadIntegrated installOnline InfoCenterBigData Univ.Breadth of capabilitiesEnterpriseclass
  • 36. BigInsights Basic EditionConnectivity and integrationJDBCFlumeInfrastructure JaqlHivePigHBaseMapReduceHDFSZooKeeperLuceneOozieOpen Source IBMIntegratedinstallerSqoopHCatalog
  • 37. BigInsights Enterprise EditionConnectivity and Integration StreamsNetezzaTextprocessingengine andlibraryJDBCFlumeInfrastructure JaqlHivePigHBaseMapReduceHDFSZooKeeperIndexing LuceneAdaptiveMapReduceOozieText compressionEnhancedsecurityFlexibleschedulerOptionalIBM andpartnerofferingsAnalytics and discovery “Apps”DB2BigSheetsWeb CrawlerDistrib file copyDB exportBoardreaderDB importAd hoc queryMachinelearningDataprocessing. . .Administrative anddevelopment toolsWeb console• Monitor cluster health, jobs, etc.• Add / remove nodes• Start / stop services• Inspect job status• Inspect workflow status• Deploy applications• Launch apps / jobs• Work with distrib file system• Work with spreadsheet interface• Support REST-based API• . . .REclipse tools• Text analytics• MapReduce programming• Jaql, Hive, Pig development• BigSheets plug-in development• Oozie workflow generationIntegratedinstallerOpen Source IBMIBMCognos BIGPFS (EAP)Accelerator formachine dataanalysisAccelerator forsocial dataanalysisGuardium DataStageData ExplorerSqoopHCatalog
  • 38. Open Source Components AcrossDistributionsComponentBigInsights2.0HortonWorksHDP 1.2MapR2.0GreenplumHD 1.2ClouderaCDH3u5ClouderaCDH4*Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3Avro 1.6.3 X X X X XFlume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1HCatalog 0.4.0 0.5.0 0.4.0 X X XBigInsights  con8nues  to  offer  the  most  proven,  stable  versions  of  Apache  Hadoop  components  *Cloudera  CDH4  Hadoop  2.0    includes  Map  Reduce  2.0  which  Cloudera  states  “not  yet  considered  stable”  
  • 39. Hadoop Systems39HDFS  Map/  Reduce    Hive,  Pig  &  Jaql  Sqoop  Zookeeper    Avro  (Serializa8on)  HBase  ETL    Tools  BI    ReporMng  RDBMS  
  • 40. BigInsights ContentFunction VersionBasicEditionEnterpriseEditionIntegrated Install Inc IncHadoop (including common utilities, HDFS, MapReduce framework) 1.0.3 Inc IncJaql (programming / query language) 0.5.2 Inc IncPig (programming / query language) 0.10.0 Inc IncFlume (data collection/aggregation) 0.9.4 Inc IncHive (data summarization/querying) 0.9.0 Inc IncLucene (text search)* 3.3.0 Inc IncZookeeper (process coordination) 3.4.3 Inc IncAvro (data serialization) 1.6.3 Inc IncHBase (real time read/write) 0.94.0 Inc IncHCatalog (table and storage management service) 0.4.0 Inc IncSqoop (RDBMS bulk data transfer) 1.4.1 Inc IncOozie (workflow/ job orchestration) 3.2.0 Inc IncOnline documentation Inc IncIntegration with JDBC sources through general-purpose Jaql module Inc IncIntegration with DB2 (sample functions to submit jobs, read data) Inc Inc
  • 41. BigInsights Content (cont’d)FunctionBasicEditionEnterpriseEditionIntegration with R (Jaql module to invoke R statistical capabilities fromBigInsights) n/a IncIntegration with Netezza, DB2 LUW with DPF from Jaql n/a IncLDAP authentication, Guardium support, etc. n/a IncIntegrated Web Console n/a IncBusiness process accelerators (social data, machine data analytics) n/a IncPlatform performance enhancements (Adaptive MapReduce, large scaleindexing, efficient processing of compressed text files, flexible jobscheduler, etc.)n/a IncText analytics n/a IncEclipse tools for text analytic development, Jaql, Hive, Java n/a IncApplications for data import/export, Web crawl, machine learning, etc. n/a IncWeb-based application catalog n/a IncSpreadsheet-like analytical tool n/a IncIBM support Opt IncStreams, Data Explorer, Cognos BI (limited use licenses) n/a IncUnlimited storage n/a Inc
  • 42. BigInsights: Value Beyond Open SourceEnterprise CapabilitiesAdministration & SecurityWorkload OptimizationConnectorsOpen sourcecomponentsAdvanced EnginesVisualization & ExplorationDevelopment ToolsIBM-certifiedApache Hadoop or or …Key  differenMators    •  Built-­‐in  analyMcs    •  Text  engine,  annotators,  Eclipse  tooling    •  Interface  to  project  R  (staMsMcal  plamorm)  •  Enterprise  sonware  integraMon  •  Spreadsheet-­‐style  analysis    •  Integrated  installaMon  of  supported  open  source  and  other  components  •  Web  Console  for  admin  and  applicaMon  access  •  Plamorm  enrichment:  addiMonal  security,  performance  features,  .  .  .        •  World-­‐class  support  •  Full  open  source  compaMbility  Business  benefits      •  Quicker  Mme-­‐to-­‐value  due  to  IBM  technology  and  support  •  Reduced  operaMonal  risk  •  Enhanced  business  knowledge  with  flexible  analyMcal  plamorm  •  Leverages  and  complements  exisMng  sonware  
  • 43. Big Insights - Demo43
  • 44. Big Data Application EcosystemEclipseApp  library  MapReduce,  …  Text  AnalyMcs  Query  App Development• Code application program, and generateassociated App• Deploy Apps to Enterprise ManagerApp  Development  PublishData  integra7on  scenario:    Pre-­‐defined  work  flows  simplify  loading  data  from  various  sources  • Work  flows  can  be  configured,  deployed,  executed  and  scheduled  Development  tooling:  • Text  analyMcs    • MapReduce  • Query  languages    •   .  .  .    Applica7on  scenarios  (web  log,  email,  social  media,  …):  •   Samples  provide  starMng  point,  speed  Mme  to  value    Big Data Web Console
  • 45. Web Console• Manage BigInsightsInspect /monitor system healthAdd / drop nodesStart / stop servicesRun / monitor jobs (applications)Explore / modify file systemCreate custom dashboards. . .• Launch applicationsSpreadsheet-like analysis toolPre-built applications (IBMsupplied or user developed)• Publish applications• Monitor cluster, applications,data, etc.
  • 46. Running Applications from the Web Console•  Import  &  Export  Data  •  Database  &  Files  •  Web  and  Social  •  Analyze  and  Query  •  Predic7ve  Analy7cs  •  Text  Analy7cs  •  SQL/Hive,  Jaql,  Pig,  HBase  
  • 47. Spreadsheet-style Analysis•  Web-based analysis andvisualization•  Spreadsheet-likeinterfaceDefine and manage longrunning data collectionjobsAnalyze content of the texton the pages that havebeen retrieved
  • 48. Get started with BigInsights•  In the CloudVia RightScale, or directly on Amazon, Rackspace, IBM Smart EnterpriseCloud, or on private clouds.Pay only for the resources used.•  In the ClassroomVia IBM EducationOnline at www.bigdatauniversity.com•  On Your ClusterDownload Basic Edition from ibm.com.•  With the BigInsights Community– Technical portal @ http://tinyurl.com/biginsights– BigData on DW @ http://ibm.co/bigdatadevLinks to demos, papers, forum, downloads, etc.• Stay connected with IBM Big Data– http://ibmbigdatahub.com
  • 49. BigDataUniversity.comLearn Big Data Technologies• Flexible on-line deliveryallows learning @your placeand @your pace§ Free courses, free studymaterials.§ Cloud-based sandboxfor exercises – zero setup§ 66666 registered students.§ Robust CourseManagement System andContent Distributioninfrastructure-49
  • 50. 50Big Data is ripe for innovation
  • 51. Backup slides
  • 52. OSS in IBM Big Data Platform52Hadoop    -­‐  hEp://hadoop.apache.org/  HDFS    -­‐  hEp://hadoop.apache.org/docs/r1.0.4/hdfs_design.html  Hive    -­‐  hEp://hive.apache.org/  Hbase    -­‐  hEp://hbase.apache.org/  Flume    -­‐  hEp://flume.apache.org/  Jaql      -­‐  hEp://code.google.com/p/jaql/wiki/Running  Oozie      -­‐  hEp://oozie.apache.org/  Sqoop    -­‐  hEp://sqoop.apache.org/  Avro    -­‐  hEp://avro.apache.org/  Lucene      -­‐  hEp://lucene.apache.org/  Pigserver  -­‐  hEp://pig.apache.org/  Zookeeper  -­‐  hEp://zookeeper.apache.org/  Top        -­‐  http://bigtop.apache.org/  
  • 53. Build a Big Data Program – MapReduce exampleEclipse toolsFor Jaql, Hive, Pig Java MapReduce, BigSheetsplug-ins, text analytics, etc.
  • 54. BigInsights Text Analytics Development
  • 55. BigInsights and Text Analytics• Distills structured info fromunstructured textSentiment analysisConsumer behaviorIllegal or suspicious activities…• Parses text and detects meaningwith annotators• Understands the context in whichthe text is analyzed• Features pre-built extractors fornames, addresses, phone numbers,etc.• Built-in support for English,Spanish, French, German,Portuguese, Dutch, Japanese,ChineseFootball World Cup 2010, one teamdistinguished themselves well, losing to theeventual champions 1-0 in the Final. Early inthe second half, Netherlands’ striker, ArjenRobben, had a breakaway, but the keeper forSpain, Iker Casillas made the save. WingerAndres Iniesta scored for Spain for the win.Unstructured text (document, email, etc)Classification and Insight
  • 56. Example Analysis : Extraction from TwittermessagesExtract intent, interests, life events and micro segmentation attributesIm at Mickeys Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 othershttp://4sq.com/gbsaYR @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;)btw happy birthday Sylvia ;)@rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyleI had an iphone, but its dead @JoaoVianaa. (Ive no idea where its) !Want a blackberrynow !!!Monetizable IntentRelocationLocationName, Birth DaySubtle Spam,AdvertisingSarcasm,Wishful ThinkingWhile accounting for less relevant messagesI think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on ituneshttp://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile@purplepleather Gotta do more research my Versace term paper 2day. Before I die, Iwant a versace purple diamond tiara. Im just sayin>lolhad so much fun today! I want to buy a million dollar house with a wrap aroundporch ... ... wading river on the long island sound, ha i wish!