Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting started with Hadoop on the Cloud with Bluemix


Published on

Silicon Valley Code Camp -- October 11, 2014.
Session: Getting started with Hadoop on the Cloud.

Hadoop and Cloud is an almost perfect marriage. Hadoop is a distributed computing framework that leverages a cluster built on commodity hardware. The Cloud simplifies provisioning of machines and software. Getting started with Hadoop on the Cloud makes it simple to provision your environment quickly and actually get started using Hadoop. IBM Bluemix has democratized Hadoop for the masses! This session will provide a brief introduction to what Hadoop is, how does cloud work and will then focus on how to get started via a series of demos. We will conclude with a discussion around the tutorials and public datasets - all of the tools needed to get you started quickly.

Learn more about BigInsights for Hadoop:

Published in: Software

Getting started with Hadoop on the Cloud with Bluemix

  1. 1. October 11, 2014 Getting started with Hadoop on the Cloud Nicolas Morales – Solutions Engineer – @NicolasJMorales © 1 2014 IBM Corporation
  2. 2. Welcome Goal: Get you started with Hadoop on the Cloud Hadoop − What technical problem is it helping solve? BIG DATA − What is Hadoop? − BigInsights (IBM’s Hadoop distro) Bluemix (IBM’s PaaS cloud solution) − What technical problem is it helping solve? − Analytics for Hadoop in the Cloud Demo Get hands-on − Bluemix: − Hadoop Dev: © 2 2014 IBM Corporation
  3. 3. It starts with a line of code. © 3 2014 IBM Corporation
  4. 4. © 4 2014 IBM Corporation
  5. 5. © 5 2014 IBM Corporation
  6. 6. ! #$% © 6 2014 IBM Corporation
  7. 7. What is Big Data? A way to describe data problems that are unsolvable using traditional tools More Analytics on More Data for More People © 7 2014 IBM Corporation
  8. 8. What Data? Transactional Application Data Machine Data Social Data Enterprise Content © 8 2014 IBM Corporation © 2013 IBM Corporation More Analytics on More Data for More People
  9. 9. © 9 2014 IBM Corporation 9
  10. 10. © 10 2014 IBM Corporation 10 In 2005 there were 1.3 billion RFID tags in circulation around the world…… ……by the end of 2011, this was about 30 billion and growing even faster.
  11. 11. An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of data with MACHINE SPEED characteristics… 1 BILLION lines of code EACH engine generating 10 TB every 30 minutes! © 11 2014 IBM Corporation
  12. 12. Welcome to the Instrumented Interconnected World! 12+ TBs of tweet data every day 12 25+ TBs of log data every day ? TBs of data every day 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 76 million smart meters in 2009… 200M by 2014 © 12 2014 IBM Corporation
  13. 13. 83x 6,000,000 users on Twitter pushing out 300,000 tweets per day 500,000,000 users on Twitter pushing out 400,000,000 tweets per day 1333x © 13 2014 IBM Corporation 13
  14. 14. We’ve Moved into a New Era of Computing 12+terabytes Volume Velocity Variety Veracity 5+million Only 1 in 3 of Tweets create daily. 100’s © 14 2014 IBM Corporation 14 decision makers trust their information. of different types of data. trade events per second.
  15. 15. Imagine the Possibilities of Harnessing Your Data Resources Big data challenges exist in every organization today Government cuts acoustic analysis from hours to 70 Milliseconds Retailer reduces time to run queries by 80% to optimize inventory Utility avoids power failures by analyzing 10 PB of data in minutes Stock Exchange cuts queries from 26 hours to 2 minutes on 2 PB Hospital analyses streaming vitals to detect illness 24 hours earlier Telco analyses streaming network data to reduce hardware costs by 90% © 15 2014 IBM Corporation
  16. 16. Every Industry can Leverage Big Data and Analytics Insurance • 360 View of Domain or Subject • Catastrophe Modeling • Fraud Abuse • Producer Performance Analytics • Analytics Sandbox Banking • Optimizing Offers and Cross-sell • Customer Service and Call Center Efficiency • Fraud Detection Investigation • Credit Counterparty Risk Telco • Pro-active Call Center • Network Analytics • Location Based Services Energy Utilities • Smart Meter Analytics • Distribution Load Forecasting/Scheduling • Condition Based Maintenance • Create Target Customer Offerings Media Entertainment • Business process transformation • Audience Marketing Optimization • Multi-Channel Enablement • Digital commerce optimization Retail • Actionable Customer Insight • Merchandise Optimization • Dynamic Pricing Travel Transport • Customer Analytics Loyalty Marketing • Predictive Maintenance Analytics • Capacity Pricing Optimization Consumer Products • Shelf Availability • Promotional Spend Optimization • Merchandising Compliance • Promotion Exceptions Alerts Government • Civilian Services • Defense Intelligence • Tax Treasury Services Healthcare • Measure Act on Population Health Outcomes • Engage Consumers in their Healthcare Automotive • Advanced Condition Monitoring • Data Warehouse Optimization • Actionable Customer Intelligence Life Sciences • Increase visibility into drug safety and effectiveness Chemical Petroleum • Operational Surveillance, Analysis Optimization • Data Warehouse Consolidation, Integration Augmentation • Big Data Exploration for Interdisciplinary Collaboration Aerospace Defense • Uniform Information Access Platform • Data Warehouse Optimization • Airliner Certification Platform • Advanced Condition Monitoring (ACM) Electronics • Customer/ Channel Analytics • Advanced Condition Monitoring © 16 2014 IBM Corporation © 2013 IBM Corporation
  17. 17. Enabling everybody to leverage Big Data GPS External Data Business Users ...offer personalized price promotions to different customer segments in real-time Business Development ... find and deliver new mechanisms to monetize network traffic and partner with upstream content providers Administrators, manage, and optimize data access and analysis operations Executive Leaders ...get real-time reports and analysis based on data inside as well as outside the enterprise (web, social media etc.) Business Analysts ... analyze social media buzz for the new services/offerings to gauge initial success and any course correction needed Developers ... develop new Apps and detailed algorithms in response to user and business requirements Data Scientists ... analyze subscriber usage pattern in real-time and combine that with the profile for delivering promotional or retention offers © 17 2014 IBM Corporation
  18. 18. Leveraging Big Data Requires Multiple Platform Capabilities Understand and navigate federated big data sources Manage store huge volume of any data Federated Discovery and Navigation Hadoop File System MapReduce Structure and control data Data Warehousing Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM © 18 2014 IBM Corporation
  19. 19. What is Hadoop? Apache open source software framework for reliable, scalable, distributed computing of massive amount of data Hides underlying system details and complexities from user Developed in Java Core sub projects: − MapReduce − Hadoop Distributed File System a.k.a. HDFS Supported by several Hadoop-related projects HBase Zookeeper Avro Flume etc Meant for heterogeneous commodity hardware © 19 2014 IBM Corporation
  20. 20. Design Principles of Hadoop New way of storing and processing the data: − Let system handle most of the issues automatically: • Failures • Scalability • Reduce communications • Distribute data and processing power to where the data is • Make parallelism part of operating system • Relatively inexpensive hardware Bring processing to Data! Hadoop = HDFS + MapReduce infrastructure + … Optimized to handle − Massive amounts of data through parallelism − A variety of data (structured, unstructured, semi-structured) − Using inexpensive commodity hardware Reliability provided through replication © 20 2014 IBM Corporation
  21. 21. Map-Reduce Hadoop BigInsights © 21 2014 IBM Corporation
  22. 22. Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects © 22 2014 IBM Corporation
  23. 23. What’s a Hadoop Distribution? What’s a Linux Distribution? − Linux Kernel − Open Source Tools around Kernel − Installer − Administration UI Open Source Distribution Formula − Kernel − Core Projects around Kernel − Value Add • Test Components • Installer • Administration UI • Apps © 23 2014 IBM Corporation
  24. 24. IBM Enriches Hadoop Scalable − New nodes can be added on the fly Affordable − Massively parallel computing on commodity servers Flexible − Hadoop is schema-less, and can absorb any type of data Fault Tolerant − Through MapReduce software framework Performance reliability − Adaptive MapReduce, Compression, Indexing, Flexible Scheduler, +++ Enterprise Hardening of Hadoop Productivity Accelerators − Web-based UI’s and tools − End-user visualization − Analytic Accelerators − +++ Enterprise Integration − To extend enrich your information supply chain © 24 2014 IBM Corporation 24
  25. 25. IBM BigInsights – Open Source and IBM Value Adds ANSI SQL BigSQL Optimized SQL support Search BigIndex and Data Explorer Predictive Modeling BigR scalable data mining” on R Real-time Analytics InfoSphere Streams Application Tooling Toolkits and accelerators Data Exploration BigSheets “schema-on-read” tooling Text Analytics Text processing with AQL Data Governance and Security Data Click, LDAP and Secured Cluster Enterprise Performance Adaptive Map Reduce Big SQL Storage Integration GPFS POSIX Distributed Filesystem Oozie Jaql ZooKeeper Hive HDFS MapReduce HBase Flume Pig Lucene HCatalog Sqoop 100% based on Apache Open Source Hadoop Components © 25 2014 IBM Corporation
  26. 26. Manage your cluster from the integrated Web Console Start or stop services Monitor overall system health Inspect status of specific services Add / remove nodes Manage your Apps and workflows from the console Drill down into Map/Reduce, Tasks, Attempts Access status, logs, counters of individual flows / jobs © 26 2014 IBM Corporation
  27. 27. Manage your HDFS Files Navigate the distributed file system to see what’s stored Create/remove/rename directories Modify permissions Upload / download files, remove/rename files, Edit files Execute Hadoop file system shell commands © 27 2014 IBM Corporation
  28. 28. Monitoring cluster, components and applications Cluster: system load average, CPU/Disk/Memory/Network utilization, nodes live status HDFS: block and file info, NameNode JVM and GC info, throughput bytes written/read Mapreduce: Jobs status, Mapper, Reducer, JobTracker HBase: region split info, #of queries/stored files/regions etc Hive: metadata store (call frequency and duration) Oozie statistics Zookeeper: queries, latency, watcher count, followers etc Flume: source and sink, #of retries and bytes written etc EXT E N S I B L E !! Build your own Monitoring Dashboards, with the key KPI that are of your interest! © 28 2014 IBM Corporation
  29. 29. Text Analytics: Getting measurable insights Most of the world’s data is in unstructured or semi-structured text. Social media is full with discussions about products and services Company Internal Information is locked in blobs, description fields, and sometimes even discarded How do you get a metrics based understanding of facts from unstructured text? '()
  30. 30. )*
  31. 31. + Healthcare Analytics: E-Medical records, hospital reports Public Sectors Case files, police records, emergency calls… Automotive Quality Insight: Tech notes, call logs, online media Insurance Fraud: Insurance claims Social Media for Marketing: twitter, facebook, blogs, forums
  32. 32. © 29 2014 IBM Corporation
  33. 33. Big R R Clients “End-to-end integration of R into IBM BigInsights” Pull data (summaries) to R client Data Sources R Packages 1 2 Embedded R Execution R Packages 1. Explore, visualize, transform, and model big data using familiar R syntax and paradigm 2. Scale out R • Partitioning of large data (“divide”) • Parallel cluster execution of pushed down R code (“conquer”) • All of this from within the R environment (Jaql, Map/Reduce are hidden from you • Almost any R package can run in this environment Or, push R functions right on the data © 30 2014 IBM Corporation
  34. 34. BigSheets - Spreadsheet-style Analytic Tool How it works Model “big data” collected from various Filter and enrich content with built-in Combine data in different collections Visualize results through spreadsheets, Export data into common formats (if No programming knowledge needed! sources as collections functions charts desired) © 31 2014 IBM Corporation
  35. 35. Overview of Application Development Lifecycle Editors for: Java, Java MapReduce, Hive, Jaql, Pig, Big SQL, BigSheets Reader, BigSheets Macro, AQL module, Jaql Module, etc … Package and publish your application using the BigInsights Eclipse Task Launcher How it works Sample your Data Develop your application using BigInsights tools Test your application Package and publish your application Deploy your application on the cluster Task Wizards for the ease of use to Develop Applications © 32 2014 IBM Corporation
  36. 36. Running Applications in Big Data How it works Build in Apps make it easy to run Big Data applications tasks: Import and Export Data from a Database or files Import and Export Web and Social Data Perform Tex Analytics on specified content Query HBase Content Query content stored in BigInsights using Big SQL. Execute Pig or JAQL applications. E XT E N S I B L E !! Build your own applications and make them easy to execute from an appealing Application launcher © 33 2014 IBM Corporation
  37. 37. Big SQL SQL-based Application IBM data server client Big SQL Engine SQL MPP Run-time Data Sources CSV CSV Seq Seq Parquet Parquet RC RC ORC ORC Avro Avro Custom Custom JSON JSON – SELECT: joins, unions, aggregates, subqueries . . . – GRANT/REVOKE, INSERT … INTO – PL/SQL – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers – Java MapReduce layer replaced with high performance – Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes – In-memory operations with ability to spill to disk (useful for aggregrations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules Integration with RDBMSs via LOAD, query 34 IBM’s SQL engine for Hadoop Comprehensive, standard SQL Optimization and performance IBM MPP engine (C++) without persisting intermediate results Various storage formats supported – Data persisted in DFS, Hive – No IBM proprietary format required federation BigInsights © 34 2014 IBM Corporation
  38. 38. © 35 2014 IBM Corporation 3 5 Big Data Accelerators Make it Easier than Ever to Build Big Data Applications Telecommunications Event Data CDR streaming analytics Deep Customer Event Analytics Ships with InfoSphere Streams Social Data Analytics Sentiment Analytics, Intent to purchase Ships with InfoSphere BigInsights Streams Machine Data Analytics Operational data including logs for operations efficiency Ships with InfoSphere BigInsights
  39. 39. Social Data Analytics Using social media as a rich source of information Maybe our politicians should take a playbook out of the rivalry between duke/unc and take it to the courts Maybe our politicians should take a playbook out of the rivalry between duke/unc and take it to the courts Behavior I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Raleigh) w/ 2 others I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Raleigh) w/ 2 others @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur 43rd birthday ;) btw happy birthday Sylvia ;) @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur 43rd birthday ;) btw happy birthday Sylvia ;) Location Interest @silliesylvia I 3 your leather leggings!! Its so katniss!! @silliesylvia I 3 your leather leggings!! Its so katniss!! Interest @bamagirl can’t wait to watch sherlock with you! Oh, robert downey jr, I still love you but bbc is so amazing @bamagirl can’t wait to watch sherlock with you! Oh, robert downey jr, I still love you but bbc is so amazing Intent to consume Age 360 degree profile Personal Attributes • Sylvia Campbell, Female, In a Relationship • 32 years old, birthday on 7/17 • Lives near Raleigh, NC • College graduate; Income of 80-120k Buzz/Sentiment • Retweets BF’s comments • Interest in BBC shows: Downton Abbey, Sherlock, Fringe, (PP?) • Sherlock Holmes, Robert Downey, Jr. • Hunger Games, Katniss/J. Lawrence Interests/Behavior • Watch movies, tv shows • Romance plots, “hero types”, strong women • Uses iPad 3, Redbox, Hulu • Shopping , interest in sales/deals • Duke/ UNC basketball Consumption dear redbox please have kings speech for my new tv colin firth movie marathon dear redbox please have kings speech for my new tv colin firth movie marathon Intent to consume @silliesylvia $10 dollars says matthew mary get married next season :) #downtownabbey @silliesylvia $10 dollars says matthew mary get married next season :) #downtownabbey OMG OMG. just dropped my new ipad3 crappola!!! OMG OMG. just dropped my new ipad3 crappola!!! Consumption Prediction © 36 2014 IBM Corporation
  40. 40. Machine Data Analysis is a Business Imperative Cost of system down-time − 49 percent of Fortune 500 companies experience more than 80 hours of system down time annually1 • Cost of down-time varies from $90,000/hour in the media sector to $6.48 million / hour for large online brokerages • 80 hours * $6.48M = approx $500M per year − System downtown costs North American businesses $26.5 billion a year in lost revenue2 When systems go down − Sales and other processes stop − Work in progress may be destroyed − Failure to meet SLA’s and contractual obligations can result in damages, fees, adverse publicity and damage to reputation − Customers are lost to competitors, some permanently − Productivity suffers and remediation costs additional $$$’s © 37 2014 IBM Corporation 37 © 2013 IBM Corporation
  41. 41. © 38 2014 IBM Corporation
  42. 42. Evolution of Cloud Technologies Virtualization Dynamic Hybrid “I want to get more out of my existing hardware” “I want to strategically use public and private cloud together”. “I want to move my existing middleware workloads to the cloud” Cloud Native “I want to rapidly build new, born on the cloud, engaging applications in a continuous delivery model” Cloud Enabled Business Services (SaaS) “I want to use an app without having to own it” © 39 2014 IBM Corporation
  43. 43. PaaS sits at the center of the cloud delivery model IT Admin Infrastructure as a Service Developer Business Person Platform as a Service Software as a Service Client Manages Applications Applications Applications Data Data Data Runtime Runtime Runtime Vendor Manages in Cloud Middleware Middleware Middleware O/S O/S O/S Vendor Manages in Cloud Virtualization Virtualization Virtualization Servers Servers Servers Storage Storage Storage Networking Networking Networking Vendor Manages in Cloud Client Manages CCuussttoommiizzaattiioonn;; hhiigghheerr ccoossttss;; sslloowweerr ttiimmee ttoo vvaalluuee Standardization; lower costs; faster time to value © 40 2014 IBM Corporation
  44. 44. • Move quickly, see results fast. • Learn by tinkering and playing. • Needs to learn new skills through playing and experimenting safely. • Needs freedom to experiment without worrying about pricing right away. Developers, Developers, Developers! © 41 2014 IBM Corporation
  45. 45. © 42 2014 IBM Corporation 42 Bluemix is an open-standard, cloud-based platform for building, managing, and running applications of all types (web, mobile, big data, new smart devices, and so on). Go Live in Seconds The developer can choose any language runtime or bring their own. Zero to production in one command. DevOps Development, monitoring, deployment, and logging tools allow the developer to run the entire application. APIs and Services A catalog of IBM, third party, and open source API services allow the developer to stitch an application together in minutes. On-Prem Integration Build hybrid environments. Connect to on-premise assets plus other public and private clouds. Flexible Pricing Sign up in minutes. Pay as you go and subscription models offer choice and flexibility. Layered Security IBM secures the platform and infrastructure and provides you with the tools to secure your apps. What is Bluemix?
  46. 46. Create apps quickly with prebuilt services Choice Watson Services © 43 2014 IBM Corporation 43 • Runtimes, services, and tooling up to you Industry Leading IBM Capabilities • Services leveraging the depth of IBM software • Full range of capabilities Completeness • Open source platform and services • Third party to enable key use cases Security Services Web and application services Cloud Integration Services Mobile Services Database services Big Data services Internet of Things Services DevOps Services A full range of capabilities to suit any great idea.
  47. 47. Embracing Cloud Foundry as an Open Source PaaS Continuing our history of embracing and extending Open Source 44 44 © ©2014 2014 IBM IBM Corporation Corporation
  48. 48. Cloud Foundry is more than code Meets Developer’s Needs Focus on app development, not provisioning VMs, databases, messaging servers, etc. Agile development model Deploy and scale in seconds Open Cloud Platform There is an increasing appetite for cloud-based mobile, social and analytics applications from line-of-business executives - drives the need for a more open cloud development platform Compelling Community Cloud Foundry has a compelling community and emerging ecosystem as well as a mature set of capabilities and robustness © 45 2014 IBM Corporation
  49. 49. IBM extends CF by adding developer tools, runtimes, services Capabilities include Java, mobile backend development, application monitoring, as well as capabilities from ecosystem partners and open source — all through an as-a-service model in the cloud. © 46 2014 IBM Corporation
  50. 50. An Entire Continuum Working Together Infrastructure Services Virtual Appliance Application Server Operating system Metadata Virtual Appliance Application Server Operating system Metadata Virtual Appliance HTTP Server Operating system Metadata Defined Pattern Services Systems of Record Business Services Composable Services Analytics © 47 2014 IBM Corporation
  51. 51. IBM Analytics for Hadoop Service Powered by − BigInsights 3.0 Bluemix Get started with Hadoop in Minutes − Tutorial: Dedicated Single Node Env • BIAdmin Authority • Access to the Web console • Secure HTTPS channel powered by SSL certificates • Bluemix Single Sign On (SSO) © 48 2014 IBM Corporation
  52. 52. Register today at With on-demand services and infrastructure, developers can go from 0 to running code in a matter of minutes. 1. Rapidly bring products and services to market at lower cost 2. Continuously deliver new functionality to their applications 3. Extend existing investments in IT infrastructure When coupled with DevOps, teams both large and small can automate the development and delivery of many applications. By connecting securely to on-prem infrastructure, organizations can extend their existing investments. © 49 2014 IBM Corporation
  53. 53. Want to learn more? Download Quick Start Edition Test drive the technologies – Follow online tutorials – Enroll in online classes – Watch video demos, read articles, etc. Links all available from HadoopDev – © 50 2014 IBM Corporation
  54. 54. BigInsights Quick Start Edition Download: © 51 2014 IBM Corporation
  55. 55. Big Data Developers FREE All types of practitioners All skill levels Hands-on Labs Future Meetups: − Hadoop − Text Analytics − Real-time Analytics − SQL for Hadoop − HBase − Social Media Analytics − Machine Data Analytics − Security and Privacy © 52 2014 IBM Corporation