Getting started with Hadoop on the Cloud with Bluemix

October 11, 2014
Getting started with
Hadoop on the Cloud
Nicolas Morales – Solutions Engineer – nicolasm@us.ibm.com
@NicolasJMorales
© 1 2014 IBM Corporation

Welcome
Goal: Get you started with Hadoop on the Cloud
Hadoop
− What technical problem is it helping solve? BIG DATA
− What is Hadoop?
− BigInsights (IBM’s Hadoop distro)
Bluemix (IBM’s PaaS cloud solution)
− What technical problem is it helping solve?
− Analytics for Hadoop in the Cloud
Demo Get hands-on
− Bluemix: bluemix.net
− Hadoop Dev: ibm.biz/hadoopdev

It starts with a line of code.

! #$%

What is Big Data?
A way to describe data problems that are
unsolvable using traditional tools
More Analytics on More Data for More People

What Data?
Transactional
Application Data
Machine Data Social Data Enterprise
Content
© 2013 IBM Corporation
More Analytics on More Data for More People

10
In 2005 there were 1.3 billion RFID tags in
circulation around the world……
……by the end of 2011, this was about 30
billion and growing even faster.

An increasingly sensor-enabled and instrumented
business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…
1 BILLION lines of code
EACH engine generating 10 TB every 30 minutes!

Welcome to the Instrumented Interconnected World!
12+ TBs
of tweet data
every day
12
25+ TBs of
log data
every day
? TBs of
data every
day
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
devices
sold
annually
2+
billion
people
on the
Web by
end
2011
30 billion
RFID tags
today
(1.3B in 2005)
76 million smart
meters in 2009…
200M by 2014

83x
6,000,000 users on Twitter
pushing out 300,000
tweets per day
500,000,000 users on Twitter
pushing out 400,000,000
tweets per day
1333x
13

We’ve Moved into a New Era of Computing
12+terabytes
Volume
Velocity
Variety Veracity
5+million
Only 1 in 3
of Tweets
create daily.
100’s
14
decision makers trust
their information.
of different types of data.
trade events
per second.

Imagine the Possibilities of Harnessing Your Data Resources
Big data challenges exist in every organization today
Government cuts acoustic
analysis from hours to
70 Milliseconds
Retailer reduces time to
run queries by 80% to
optimize inventory
Utility avoids power
failures by analyzing
10 PB of data in minutes
Stock Exchange cuts
queries from 26 hours to
2 minutes on 2 PB
Hospital analyses streaming
vitals to detect illness
24 hours earlier
Telco analyses streaming
network data to reduce
hardware costs by 90%

Every Industry can Leverage Big Data and Analytics
Insurance
• 360

View of Domain
or Subject
• Catastrophe Modeling
• Fraud Abuse
• Producer Performance
Analytics
• Analytics Sandbox
Banking
• Optimizing Offers and
Cross-sell
• Customer Service and
Call Center Efficiency
• Fraud Detection
Investigation
• Credit Counterparty
Risk
Telco
• Pro-active Call Center
• Network Analytics
• Location Based
Services
Energy
Utilities
• Smart Meter Analytics
• Distribution Load
Forecasting/Scheduling
• Condition Based
Maintenance
• Create Target
Customer Offerings
Media
Entertainment
• Business process
transformation
• Audience Marketing
Optimization
• Multi-Channel
Enablement
• Digital commerce
optimization
Retail
• Actionable Customer
Insight
• Merchandise
Optimization
• Dynamic Pricing
Travel
Transport
• Customer Analytics
Loyalty Marketing
• Predictive Maintenance
Analytics
• Capacity Pricing
Optimization
Consumer
Products
• Shelf Availability
• Promotional Spend
Optimization
• Merchandising
Compliance
• Promotion Exceptions
Alerts
Government
• Civilian Services
• Defense Intelligence
• Tax Treasury Services
Healthcare
• Measure Act on
Population Health
Outcomes
• Engage Consumers in
their Healthcare
Automotive
• Advanced Condition
Monitoring
• Data Warehouse
Optimization
• Actionable Customer
Intelligence
Life
Sciences
• Increase visibility into
drug safety and
effectiveness
Chemical
Petroleum
• Operational Surveillance,
Analysis Optimization
• Data Warehouse
Consolidation, Integration
Augmentation
• Big Data Exploration for
Interdisciplinary
Collaboration
Aerospace
Defense
• Uniform Information
Access Platform
• Data Warehouse
Optimization
• Airliner Certification
Platform
Monitoring (ACM)
Electronics
• Customer/ Channel
Analytics
Monitoring
© 2013 IBM Corporation

Enabling everybody to leverage Big Data
GPS
External Data
Business Users
...offer personalized price
promotions to different customer
segments in real-time
Business Development
... find and deliver new
mechanisms to monetize
network traffic and partner
with upstream content
providers
Administrators
...secure, manage, and optimize data
access and analysis operations
Executive Leaders
...get real-time reports and analysis
based on data inside as well as
outside the enterprise (web, social
media etc.)
Business Analysts
... analyze social media buzz
for the new services/offerings
to gauge initial success and
any course correction needed
Developers
... develop new Apps and
detailed algorithms in response
to user and business
requirements
Data Scientists
... analyze subscriber usage pattern
in real-time and combine that with the
profile for delivering promotional or
retention offers

Leveraging Big Data Requires Multiple Platform Capabilities
Understand and navigate
federated big data sources
Manage store huge
volume of any data
Federated Discovery and Navigation
Hadoop File System
MapReduce
Structure and control data Data Warehousing
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM

What is Hadoop?
Apache open source software framework for reliable, scalable, distributed
computing of massive amount of data
Hides underlying system details and complexities from user
Developed in Java
Core sub projects:
− MapReduce
− Hadoop Distributed File System a.k.a. HDFS
Supported by several Hadoop-related projects
HBase
Zookeeper
Avro
Flume
etc
Meant for heterogeneous commodity hardware

Design Principles of Hadoop
New way of storing and processing the data:
− Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Relatively inexpensive hardware
Bring processing to Data!
Hadoop = HDFS + MapReduce infrastructure + …
Optimized to handle
− Massive amounts of data through parallelism
− A variety of data (structured, unstructured, semi-structured)
− Using inexpensive commodity hardware
Reliability provided through replication

Map-Reduce Hadoop BigInsights

Hadoop Open Source Projects
Hadoop is supplemented by an ecosystem of open source projects

What’s a Hadoop Distribution?
What’s a Linux Distribution?
− Linux Kernel
− Open Source Tools around Kernel
− Installer
− Administration UI
Open Source Distribution Formula
− Kernel
− Core Projects around Kernel
− Value Add
• Test Components
• Installer
• Administration UI
• Apps

IBM Enriches Hadoop
Scalable
− New nodes can be added
on the fly
Affordable
− Massively parallel computing on
commodity servers
Flexible
− Hadoop is schema-less, and can absorb
any type of data
Fault Tolerant
− Through MapReduce
software framework
Performance reliability
− Adaptive MapReduce, Compression,
Indexing, Flexible Scheduler, +++
Enterprise Hardening of Hadoop
Productivity Accelerators
− Web-based UI’s and tools
− End-user visualization
− Analytic Accelerators
− +++
Enterprise Integration
− To extend enrich your information
supply chain
24

IBM BigInsights – Open Source and IBM Value Adds
ANSI SQL
BigSQL Optimized SQL support
Search
BigIndex and Data Explorer
Predictive Modeling
BigR scalable data mining” on R
Real-time Analytics
InfoSphere Streams
Application Tooling
Toolkits and accelerators
Data Exploration
BigSheets “schema-on-read” tooling
Text Analytics
Text processing with AQL
Data Governance and Security
Data Click, LDAP and Secured Cluster
Enterprise Performance
Adaptive Map Reduce Big SQL
Storage Integration
GPFS POSIX Distributed Filesystem
Oozie Jaql ZooKeeper Hive
HDFS MapReduce HBase Flume
Pig
Lucene
HCatalog
Sqoop
100% based on Apache Open Source Hadoop Components

Manage your cluster from the integrated Web Console
Start or stop services
Monitor overall system health
Inspect status of specific services
Add / remove nodes
Manage your Apps and workflows from the console
Drill down into Map/Reduce, Tasks, Attempts
Access status, logs, counters of individual flows /
jobs

Manage your HDFS Files
Navigate the distributed file system to see what’s stored
Create/remove/rename directories
Modify permissions
Upload / download files, remove/rename files, Edit files
Execute Hadoop file system shell commands

Monitoring cluster, components and applications
Cluster: system load average,
CPU/Disk/Memory/Network
utilization, nodes live status
HDFS: block and file info,
NameNode JVM and GC info,
throughput bytes written/read
Mapreduce: Jobs status, Mapper,
Reducer, JobTracker
HBase: region split info, #of
queries/stored files/regions etc
Hive: metadata store (call frequency
and duration)
Oozie statistics
Zookeeper: queries, latency,
watcher count, followers etc
Flume: source and sink,
#of retries and bytes written etc
EXT E N S I B L E !!
Build your own Monitoring Dashboards,
with the key KPI that are of your interest!

Text Analytics: Getting measurable insights
Most of the world’s data is in unstructured or semi-structured text.
Social media is full with discussions about products and services
Company Internal Information is locked in blobs, description fields, and
sometimes even discarded
How do you get a metrics based understanding of facts from unstructured text?
'()

+
Healthcare Analytics: E-Medical records, hospital
reports
Public Sectors Case files, police records, emergency calls…
Automotive Quality Insight: Tech notes, call logs,
online media
Insurance Fraud: Insurance claims
Social Media for Marketing: twitter, facebook, blogs,
forums

Big R
R Clients
“End-to-end integration of R into IBM BigInsights”
Pull data
(summaries) to
R client
Data Sources
R Packages
1
2
Embedded R Execution
R Packages
1. Explore, visualize, transform,
and model big data using
familiar R syntax and
paradigm
2. Scale out R
• Partitioning of large data
(“divide”)
• Parallel cluster execution of
pushed down R code
(“conquer”)
• All of this from within the R
environment (Jaql,
Map/Reduce are hidden
from you
• Almost any R package can
run in this environment
Or, push R
functions
right on the
data

BigSheets - Spreadsheet-style Analytic Tool
How it works
Model “big data” collected from various
Filter and enrich content with built-in
Combine data in different collections
Visualize results through spreadsheets,
Export data into common formats (if
No programming knowledge needed!
sources as collections
functions
charts
desired)

Overview of Application Development Lifecycle
Editors for: Java, Java MapReduce, Hive, Jaql, Pig, Big
SQL, BigSheets Reader, BigSheets Macro, AQL
module, Jaql Module, etc …
Package and publish your
application using
the BigInsights Eclipse
Task Launcher
How it works
Sample your Data
Develop your application using
BigInsights tools
Test your application
Package and publish your
application
Deploy your application on the
cluster
Task Wizards for the ease of use
to Develop Applications

Running Applications in Big Data
How it works
Build in Apps make it easy to run Big
Data applications tasks:
Import and Export Data from a
Database or files
Import and Export Web and Social
Data
Perform Tex Analytics on specified
content
Query HBase Content
Query content stored in BigInsights
using Big SQL.
Execute Pig or JAQL applications.
E XT E N S I B L E !! Build your own
applications and make them easy to
execute from an appealing Application
launcher

Big SQL
SQL-based
Application
IBM data server
client
Big SQL Engine
SQL MPP Run-time
Data Sources
CSV
CSV
Seq
Seq
Parquet
Parquet
RC
RC
ORC
ORC
Avro
Avro
Custom
Custom
JSON
JSON
– SELECT: joins, unions, aggregates, subqueries . . .
– GRANT/REVOKE, INSERT … INTO
– PL/SQL
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
– Java MapReduce layer replaced with high performance
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes
– In-memory operations with ability to spill to disk (useful
for aggregrations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
Integration with RDBMSs via LOAD, query
34
IBM’s SQL engine for Hadoop
Comprehensive, standard SQL
Optimization and performance
IBM MPP engine (C++)
without persisting intermediate results
Various storage formats supported
– Data persisted in DFS, Hive
– No IBM proprietary format required
federation
BigInsights

3
5
Big Data Accelerators Make it Easier than Ever to Build Big Data
Applications
Telecommunications
Event Data
CDR streaming analytics
Deep Customer Event
Analytics
Ships with InfoSphere
Streams
Social Data Analytics
Sentiment Analytics, Intent to
purchase
BigInsights Streams
Machine Data
Analytics
Operational data including
logs for operations efficiency
BigInsights

Social Data Analytics
Using social media as a rich source of information
Maybe our politicians should take
a playbook out of the rivalry
between duke/unc and take it
to the courts
http://ity.com/wfUsir
Maybe our politicians should take
a playbook out of the rivalry
between duke/unc and take it
to the courts
http://ity.com/wfUsir
Behavior
I'm at Mickey's Irish Pub Downtown
(206 3rd St, Court Ave, Raleigh) w/
2 others http://4sq.com/gbsaYR
I'm at Mickey's Irish Pub Downtown
(206 3rd St, Court Ave, Raleigh) w/
2 others http://4sq.com/gbsaYR
@silliesylvia good!!! U
shouldnt! Think about the
important stuff, like ur 43rd
birthday ;)
btw happy birthday Sylvia ;)
@silliesylvia good!!! U
shouldnt! Think about the
important stuff, like ur 43rd
birthday ;)
btw happy birthday Sylvia ;)
Location
Interest
@silliesylvia I 3 your leather
leggings!! Its so katniss!!
@silliesylvia I 3 your leather
leggings!! Its so katniss!!
Interest
@bamagirl can’t wait to
watch sherlock with you!
Oh, robert downey jr, I still
love you but bbc is so
amazing
@bamagirl can’t wait to
watch sherlock with you!
Oh, robert downey jr, I still
love you but bbc is so
amazing
Intent to consume
Age
360 degree profile
Personal Attributes
• Sylvia Campbell, Female, In a
Relationship
• 32 years old, birthday on 7/17
• Lives near Raleigh, NC
• College graduate; Income of 80-120k
Buzz/Sentiment
• Retweets BF’s comments
• Interest in BBC shows: Downton Abbey,
Sherlock, Fringe, (PP?)
• Sherlock Holmes, Robert Downey, Jr.
• Hunger Games, Katniss/J. Lawrence
Interests/Behavior
• Watch movies, tv shows
• Romance plots, “hero types”, strong
women
• Uses iPad 3, Redbox, Hulu
• Shopping , interest in sales/deals
• Duke/ UNC basketball
Consumption
dear redbox please have
kings speech for my new tv
colin firth movie marathon
dear redbox please have
kings speech for my new tv
colin firth movie marathon
Intent to consume
@silliesylvia $10 dollars says
matthew mary get married
next season :)
#downtownabbey
@silliesylvia $10 dollars says
matthew mary get married
next season :)
#downtownabbey
OMG OMG. just
dropped my new ipad3
crappola!!!
OMG OMG. just
dropped my new ipad3
crappola!!!
Consumption
Prediction

Machine Data Analysis is a Business Imperative
Cost of system down-time
− 49 percent of Fortune 500 companies experience more than 80 hours of system down time
annually1
• Cost of down-time varies from $90,000/hour in the media sector to $6.48 million / hour for large
online brokerages
• 80 hours * $6.48M = approx $500M per year
− System downtown costs North American businesses $26.5 billion a year in lost revenue2
When systems go down
− Sales and other processes stop
− Work in progress may be destroyed
− Failure to meet SLA’s and contractual obligations can result in damages, fees, adverse publicity
and damage to reputation
− Customers are lost to competitors, some permanently
− Productivity suffers and remediation costs additional $$$’s
37 © 2013 IBM Corporation

Evolution of Cloud Technologies
Virtualization Dynamic Hybrid
“I want to get more out
of my existing
hardware”
“I want to strategically
use public and private
cloud together”.
“I want to move my
existing middleware
workloads to the cloud”
Cloud Native
“I want to rapidly build new,
born on the cloud, engaging
applications in a continuous
delivery model”
Cloud Enabled
Business Services (SaaS)
“I want to use an app
without having to own it”

PaaS sits at the center of the cloud delivery model
IT
Admin
Infrastructure
as a Service
Developer Business Person
Platform
as a Service
Software
as a Service
Client Manages
Applications Applications Applications
Data Data Data
Runtime Runtime Runtime
Vendor Manages in Cloud
Middleware Middleware Middleware
O/S O/S O/S
Virtualization Virtualization Virtualization
Servers Servers Servers
Storage Storage Storage
Networking Networking Networking
Client Manages
CCuussttoommiizzaattiioonn;; hhiigghheerr ccoossttss;; sslloowweerr ttiimmee ttoo vvaalluuee
Standardization; lower costs; faster time to
value

• Move quickly, see results fast.
• Learn by tinkering and
playing.
• Needs to learn new skills
through playing and
experimenting safely.
• Needs freedom to experiment
without worrying about
pricing right away.
Developers, Developers, Developers!

42
Bluemix is an open-standard, cloud-based platform for building, managing,
and running applications of all types (web, mobile, big data, new smart
devices, and so on).
Go Live in Seconds
The developer can choose
any language runtime or
bring their own. Zero to
production in one command.
DevOps
Development, monitoring,
deployment, and logging tools
allow the developer to run the
entire application.
APIs and Services
A catalog of IBM, third party,
and open source API services
allow the developer to stitch an
application together in minutes.
On-Prem Integration
Build hybrid environments.
Connect to on-premise assets
plus other public and private
clouds.
Flexible Pricing
Sign up in minutes. Pay as
you go and subscription
models offer choice and
flexibility.
Layered Security
IBM secures the platform and
infrastructure and provides
you with the tools to secure
your apps.
What is Bluemix?

Create apps quickly with prebuilt services
Choice
Watson
Services
43
• Runtimes, services, and
tooling up to you
Industry Leading IBM Capabilities
• Services leveraging the
depth of IBM software
• Full range of capabilities
Completeness
• Open source platform and
services
• Third party to enable key use
cases
Security
Services
Web and
application
services
Cloud
Integration
Services
Mobile
Services
Database
services
Big Data
services
Internet
of Things
Services
DevOps
Services
A full range of capabilities to suit any great idea.

Embracing Cloud Foundry as an Open Source PaaS
Continuing our history of embracing and extending Open Source
44 44 © ©2014 2014 IBM IBM Corporation
Corporation

Cloud Foundry is more than code
Meets Developer’s
Needs
Focus on app
development, not
provisioning VMs,
databases, messaging
servers, etc.
Agile development
model
Deploy and scale in
seconds
Open Cloud Platform
There is an increasing
appetite for cloud-based
mobile, social and analytics
applications
from line-of-business
executives - drives the need
for a more open cloud
development platform
Compelling Community
Cloud Foundry has a
compelling community and
emerging ecosystem as well
as a mature set of
capabilities and robustness

IBM extends CF by adding developer tools, runtimes, services
Capabilities include Java, mobile backend
development, application monitoring, as
well as capabilities from ecosystem
partners and open source — all through
an as-a-service model in the cloud.

An Entire Continuum Working Together
Infrastructure
Services
Virtual Appliance
Application
Server
Operating
system
Metadata
Virtual Appliance
Application
Server
Operating
system
Metadata
Virtual Appliance
HTTP
Server
Operating
system
Metadata
Defined Pattern
Services
Systems of Record
Business
Services
Composable
Services
Analytics

Getting started with Hadoop on the Cloud with Bluemix

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting started with Hadoop on the Cloud with Bluemix

Similar to Getting started with Hadoop on the Cloud with Bluemix (20)

More from Nicolas Morales

More from Nicolas Morales (10)

Recently uploaded

Recently uploaded (20)

Getting started with Hadoop on the Cloud with Bluemix