Open	
  Source	
  SW	
  @	
  IBM	
  Big	
  Data	
  
Boulder Java User Group
06/11/13
Ivan Portilla
ivanp@us.ibm.com
portilla@gmail.com
Ryan DeJana
rdejana@us.ibm.com
- 1 -
Disclaimer
ü  This presentation represents the view of the authors
and does not represent the view of IBM.
ü  All opinions expressed in this presentation are strictly of
the speakers, and do NOT represent those of IBM, IBM
management, or anyone else.
ü  IBM and IBM (logo) are trademarks or registered
trademarks of International Business Machines
Corporation in the United States and/or other countries.
ü  Many Thanks to Rafael Coss & Paul Zikopoulos for the
materials used in this presentation.
Agenda
ü  Big Data
ü  OSS in IBM Big Data platform
ü  Demo
-
3
-
4
5
Big Data
Size Equivalence
6
Name	
   Value	
   RAMAC	
   IPOD	
  
1	
  Giga	
  (GB)	
   10^9	
   200	
  
1	
  Tera	
  (TB)	
   10^12	
   200K	
   200	
  
1	
  Peta	
  (PB)	
  	
   10^15	
   200M	
   200K	
  
1	
  Exa	
  (EB)	
   10^18	
   200B	
   200M	
  
1	
  ZeEa	
  (ZB)	
   10^21	
   200T	
   200B	
  
Why Didn’t We Use All of the Big Data Before?
Big Data Includes Any of the following Characteristics:
Extracting insight in context, beyond what was previously possible.
8
Manage the complexity of
multiple relational and non-
relational data types and
schemas
Variety	
  
Streaming data and large
volume data movement
Velocity	
  
Scale from terabytes to
zettabytes
Volume	
  
Veracity
-
9
-
Up to
10,000
Times
larger
Up to 10,000
times faster
Traditional Data
Warehouse and
Business Intelligence
DataScale
DataScale
yr mo wk day hr min sec … ms µs
Exa
Peta
Tera
Giga
Mega
Kilo
Decision Frequency
Occasional Frequent Real-time
Data in Motion
DataatRest
Big Data Has New Opportunities But Needs New Analytics
-
1
0
Telco Promotions
100,000 records/sec, 6B/day
10 ms/decision
270TB for Deep Analytics
DeepQA
100s GB for Deep Analytics
3 sec/decision
Smart Traffic
250K GPS probes/sec
630K segments/sec
2 ms/decision, 4K vehicles
Homeland Security
600,000 records/sec, 50B/day
1-2 ms/decision
320TB for Deep Analytics
Applications for Big Data Analytics
Homeland	
  Security	
  
Finance	
  	
  Smarter	
  Healthcare	
   MulM-­‐channel	
  sales	
  
Telecom	
  
Manufacturing	
  
Traffic	
  Control	
  
Trading	
  AnalyMcs	
   Fraud	
  and	
  Risk	
  
Log	
  Analysis	
  
Search	
  Quality	
  
Retail:	
  Churn,	
  NBO	
  
U8li8es	
  
§  Weather	
  impact	
  analysis	
  on	
  power	
  
generaMon	
  
§  Transmission	
  monitoring	
  
§  Smart	
  grid	
  management	
  
Retail	
  
§  360°	
  View	
  of	
  the	
  Customer	
  
§  Click-­‐stream	
  analysis	
  
§  Real-­‐Mme	
  promoMons	
  
Law	
  Enforcement	
  
§  Real-­‐Mme	
  mulMmodal	
  surveillance	
  
§  SituaMonal	
  awareness	
  
§  Cyber	
  security	
  detecMon	
  
Transporta8on	
  
§  Weather	
  and	
  traffic	
  
impact	
  on	
  logisMcs	
  and	
  
fuel	
  consumpMon	
  
§  Traffic	
  congesMon	
  
Financial Services
§  Fraud detection
§  Risk management
§  360° View of the Customer
IT	
  
§  System	
  log	
  analysis	
  
§  Cybersecurity	
  
Telecommunica8ons	
  
§  CDR	
  processing	
  
§  Churn	
  predicMon	
  
§  Geomapping	
  /	
  markeMng	
  
§  Network	
  monitoring	
  
Most requested use cases of Big Data
12
Health	
  &	
  Life	
  Sciences	
  
§  Epidemic	
  early	
  warning	
  
§  ICU	
  monitoring	
  
§  Remote	
  healthcare	
  monitoring	
  
Follow this link for details on Industry Big Data use cases
13	
  
§ Public	
  wind	
  data	
  is	
  available	
  on	
  284km	
  x	
  284	
  
km	
  grids	
  (2.5o	
  LAT/LONG)	
  
§ More	
  data	
  means	
  more	
  accurate	
  and	
  richer	
  
models	
  (adding	
  hundreds	
  of	
  variables)	
  
-  Vestas	
  wind	
  library	
  at	
  2.5	
  PB:	
  to	
  grow	
  to	
  over	
  
6	
  PB	
  in	
  the	
  near-­‐term	
  
-  Granularity	
  27km	
  x	
  27km	
  grids:	
  driving	
  to	
  9x9,	
  
3x3	
  to	
  10m	
  x	
  10m	
  simulaMons	
  
§ Reduced	
  turbine	
  placement	
  idenMficaMon	
  from	
  
weeks	
  to	
  hours	
  
§ PerspecMve:	
  The	
  Vestas	
  Wind	
  library,	
  as	
  HD	
  TV	
  
would	
  take	
  70	
  years	
  to	
  watch	
  
13	
  
14
Big Data Analytics in Smarter Hospitals
IBM Data Baby
youtube.com
Big	
  Data	
  enabled	
  doctors	
  from	
  University	
  of	
  Ontario	
  to	
  apply	
  neonatal	
  infant	
  monitoring	
  to	
  
predict	
  infec8on	
  in	
  ICU	
  24	
  hours	
  in	
  advance	
  	
  
http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
IBM Watson is a breakthrough in analytic innovation, but it is only successful
because of the quality of the information from which it is working.
-
1
5
-
1
6
Big Data and Watson
InfoSphere BigInsights
POS Data
CRM Data
Social Media
Distilled Insight
-  Spending habits
-  Social relationships
-  Buying trends
Advanced
search and
analysis
Watson can consume insights from

Big Data for advanced analysis"
Big Data technology is used to build
Watson’s knowledge base"
Watson uses the Apache Hadoop
open framework to distribute the
workload for loading information into
memory."
Approx. 200M pages of text
(To compete on Jeopardy!)
Watson’s
Memory
IBM is committed to Open Source
►  Decade of lineage and contributions to
the open source community
– Apache Hadoop and Jaql, Apache
Derby, Apache Geronimo, Apache
Jakarta, +++
– Eclipse: founded by IBM
– Significant Lucene contributions via IBM
Lucene Extension Library (ILEL)
– DRDA, XQuery, SQL, XML4J, XERCES,
HTTP, Java, Linux, +++
►  IBM products built on open source
– WebSphere: Apache
– Rational: Eclipse and Apache
– InfoSphere: Eclipse and Apache, +++
►  IBM’s BigInsights (Hadoop) is 100%
open source compatible with
no forks
Introducing MapReduce
►  In 2003 and 2004 Google releases two papers that provide insight
into their success
– The Google File System
– MapReduce: Simplified Data Processing on Large Clusters
►  Introduced an approach to large scale data processing known as
MapReduce
Global TLE Framework
1
8
MapReduce
►  A programming model
– Inspired by functional programming
– Allows expressing distributed computations on large amounts of data
►  Execution framework
– Designed for large-scale data processing
– Designed to run on clusters of commodity hardware
Global TLE Framework
1
9
MapReduce, the programming model
►  Process key-value records
►  Map function:
(Kin, Vin) è list(Kinter, Vinter)
►  Barrier between map and reduce phases
– Shuffle and sort phase moves and groups like keys
►  Reduce function:
(Kinter, list(Vinter)) è list(Kout, Vout)
Global TLE Framework
2
0
Map phase, word-count example
Global TLE Framework
2
1
(line1, “Hello there.”)
(line2, “Why, hello.”)
(“hello”,1)	
  
(“there”,1)	
  
(“why”,1)	
  
(“hello”,1)	
  
Sort phase, word-count example
Global TLE Framework
2
2
(“hello”, 1)
(“hello”, 1)
(“there”,	
  1)	
  
(“why”,	
  1)	
  
Reduce phase, word-count example
Global TLE Framework
2
3
(“hello”, 1)
(“hello”, 1)
(“there”,	
  1)	
  
(“why”,	
  1)	
  
(“hello”, 2)
(“there”, 1)
(“why”, 1)
MapReduce, end to end
Global TLE Framework
2
4
Pseudocode for word-count
Global TLE Framework
2
5
def	
  mapper(line):	
  
	
  	
  foreach	
  word	
  in	
  line.split():	
  
	
  	
  	
  	
  output(word,	
  1)	
  
	
  
def	
  reducer(key,	
  values):	
  
	
  	
  output(key,	
  sum(values)	
  
Same code can be applied to thousands of lines,
even the whole web!
Google processes over 20PBs a day, much of it in
MapReduce programs.
But what about the data!
Global TLE Framework
2
6
Compute Nodes
NAS
SAN
Distributed file system enables processing to
be moved to the data!
Global TLE Framework
2
7
(key1, value1)
(key2, value2)
…
(key1, value1)
(key2, value2)
…
Processing is done local to the data
Key-value pairs are processed independently and in parallel!
Hadoop – A M/R Framework
►  Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
§ Hides underlying system details and complexities from user
§ Developed in Java
►  Core sub projects:
− MapReduce
− Hadoop Distributed File System a.k.a. HDFS
− Hadoop Common
►  Supported by several Hadoop-related projects
§ HBase
§ Zookeeper
§ Avro
§ Etc.
►  Meant for heterogeneous commodity hardware
Hadoop Architecture
Global TLE Framework
2
9
Who uses Hadoop?
Hadoop Open Source Projects
►  Hadoop is supplemented by an ecosystem of open source projects
Jaql	
  
Oozie	
  
The IBM Big Data Platform
32
InfoSphere BigInsights
Hadoop-based low latency
analytics for variety and volume
Data-At-Rest
Netezza High
Capacity Appliance
Queryable Archive for
Structured Data
Netezza 1000
BI+Ad Hoc Analytics on
Structured Data
Smart Analytics System
Operational Analytics on
Structured Data
Informix Timeseries
Time-structured analytics
InfoSphere Warehouse
Large volume structured data
analytics
InfoSphere Streams
Low Latency Analytics for
streaming data
Velocity, Variety & Volume
Data-In-Motion
MPP	
  Data	
  Warehouse	
  
Stream	
  
CompuMng	
  
InformaMon	
  
IntegraMon	
  
Hadoop	
  
InfoSphere Information
Server
High volume data integration
and transformation
Apache Hadoop:
open source framework
for the distributed processing
of large data sets across
clusters of computers using a
simple programming model
The IBM Big Data Platform
33
Integrate	
  and	
  manage	
  
the	
  full	
  variety,	
  
velocity	
  and	
  volume	
  of	
  
data	
  
	
  
	
  
Apply	
  advanced	
  
analy7cs	
  to	
  
informa7on	
  in	
  its	
  
na7ve	
  form	
  
	
  
	
  
Visualize	
  all	
  available	
  
data	
  for	
  ad-­‐hoc	
  
analysis	
  
Development	
  
environment	
  for	
  
building	
  new	
  analy7c	
  
applica7ons	
  
	
  
	
  
Workload	
  
op7miza7on	
  and	
  
scheduling	
  
	
  
	
  
	
  
Security	
  and	
  
Governance	
  
BigInsights Brings Hadoop to the Enterprise
►  BigInsights = analytical platform for
persistent Big Data
–  Based on open source & IBM technologies
–  Managed like a start-up . . . . Emphasis on
deep customer engagements, product plan
flexibility
►  Distinguishing characteristics
– Built-in analytics . . . . Enhances business
knowledge
– Enterprise software integration . . . .
Complements and extends existing
capabilities
– Production-ready platform with tooling for
analysts, developers, and
administrators. . . . Speeds time-to-value;
simplifies development and maintenance
►  IBM advantage
– Combination of software, hardware, services
and advanced research
Hadoop
System
InfoSphere BigInsights
Platform for volume, variety,
velocity
►  Enhanced Hadoop
foundation
Analytics
►  Text analytics & tooling
►  Application accelerators
Usability
►  Web console
►  Spreadsheet-style tool
►  Ready-made “apps”
Enterprise Class
►  Storage, security, cluster
management
Integration
►  Connectivity to Netezza,
DB2, JDBC databases, etc
Apache
Hadoop
Basic Edition
Enterprise Edition
Licensed
ApplicaMon	
  accelerators	
  	
  
Pre-­‐built	
  applicaMons	
  
Text	
  analyMcs	
  	
  
Spreadsheet-­‐style	
  tool	
  
RDBMS,	
  warehouse	
  connecMvity	
  
	
  AdministraMve	
  tools,	
  security	
  
Eclipse	
  development	
  tools	
  
Performance	
  enhancements	
  
.	
  .	
  .	
  .	
  	
  	
  	
  	
  	
  	
  
	
  
Free download
Integrated install
Online InfoCenter
BigData Univ.
Breadth of capabilities
Enterpriseclass
BigInsights Basic Edition
Connectivity and integration
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Lucene
Oozie
Open Source IBM
Integrated
installer
Sqoop
HCatalog
BigInsights Enterprise Edition
Connectivity and Integration Streams
Netezza
Text
processing
engine and
library
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Indexing Lucene
Adaptive
MapReduce
Oozie
Text compression
Enhanced
security
Flexible
scheduler
Optional
IBM and
partner
offerings
Analytics and discovery “Apps”
DB2
BigSheets
Web Crawler
Distrib file copy
DB export
Boardreader
DB import
Ad hoc query
Machine
learning
Data
processing
. . .
Administrative and
development tools
Web console
• Monitor cluster health, jobs, etc.
• Add / remove nodes
• Start / stop services
• Inspect job status
• Inspect workflow status
• Deploy applications
• Launch apps / jobs
• Work with distrib file system
• Work with spreadsheet interface
• Support REST-based API
• . . .
R
Eclipse tools
• Text analytics
• MapReduce programming
• Jaql, Hive, Pig development
• BigSheets plug-in development
• Oozie workflow generation
Integrated
installer
Open Source IBMIBM
Cognos BI
GPFS (EAP)
Accelerator for
machine data
analysis
Accelerator for
social data
analysis
Guardium DataStageData Explorer
Sqoop
HCatalog
Open Source Components Across
DistributionsComponent
Big
Insights
2.0
HortonWorks
HDP 1.2
MapR
2.0
Greenplum
HD 1.2
Cloudera
CDH3u5
Cloudera
CDH4*
Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *
HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1
Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1
Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2
Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3
Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3
Avro 1.6.3 X X X X X
Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0
Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1
HCatalog 0.4.0 0.5.0 0.4.0 X X X
BigInsights	
  con8nues	
  to	
  offer	
  the	
  most	
  proven,	
  stable	
  versions	
  of	
  Apache	
  Hadoop	
  components	
  
*Cloudera	
  CDH4	
  Hadoop	
  2.0	
  	
  includes	
  Map	
  Reduce	
  2.0	
  which	
  Cloudera	
  states	
  “not	
  yet	
  considered	
  stable”	
  
Hadoop Systems
3
9
HDFS	
  
Map/	
  
Reduce	
  
	
  
Hive,	
  Pig	
  &	
  Jaql	
  
Sqoop	
  
Zookeeper	
  	
  
Avro	
  (Serializa8on)	
  
HBase	
  
ETL	
  	
  
Tools	
  
BI	
  	
  
ReporMng	
  
RDBMS	
  
BigInsights Content
Function Version
Basic
Edition
Enterprise
Edition
Integrated Install Inc Inc
Hadoop (including common utilities, HDFS, MapReduce framework) 1.0.3 Inc Inc
Jaql (programming / query language) 0.5.2 Inc Inc
Pig (programming / query language) 0.10.0 Inc Inc
Flume (data collection/aggregation) 0.9.4 Inc Inc
Hive (data summarization/querying) 0.9.0 Inc Inc
Lucene (text search)* 3.3.0 Inc Inc
Zookeeper (process coordination) 3.4.3 Inc Inc
Avro (data serialization) 1.6.3 Inc Inc
HBase (real time read/write) 0.94.0 Inc Inc
HCatalog (table and storage management service) 0.4.0 Inc Inc
Sqoop (RDBMS bulk data transfer) 1.4.1 Inc Inc
Oozie (workflow/ job orchestration) 3.2.0 Inc Inc
Online documentation Inc Inc
Integration with JDBC sources through general-purpose Jaql module Inc Inc
Integration with DB2 (sample functions to submit jobs, read data) Inc Inc
BigInsights Content (cont’d)Function
Basic
Edition
Enterprise
Edition
Integration with R (Jaql module to invoke R statistical capabilities from
BigInsights) n/a Inc
Integration with Netezza, DB2 LUW with DPF from Jaql n/a Inc
LDAP authentication, Guardium support, etc. n/a Inc
Integrated Web Console n/a Inc
Business process accelerators (social data, machine data analytics) n/a Inc
Platform performance enhancements (Adaptive MapReduce, large scale
indexing, efficient processing of compressed text files, flexible job
scheduler, etc.)
n/a Inc
Text analytics n/a Inc
Eclipse tools for text analytic development, Jaql, Hive, Java n/a Inc
Applications for data import/export, Web crawl, machine learning, etc. n/a Inc
Web-based application catalog n/a Inc
Spreadsheet-like analytical tool n/a Inc
IBM support Opt Inc
Streams, Data Explorer, Cognos BI (limited use licenses) n/a Inc
Unlimited storage n/a Inc
BigInsights: Value Beyond Open Source
Enterprise Capabilities
Administration & Security
Workload Optimization
Connectors
Open source
components
Advanced Engines
Visualization & Exploration
Development Tools
IBM-certified
Apache Hadoop or or …
Key	
  differenMators	
  	
  
•  Built-­‐in	
  analyMcs	
  	
  
•  Text	
  engine,	
  annotators,	
  Eclipse	
  tooling	
  	
  
•  Interface	
  to	
  project	
  R	
  (staMsMcal	
  plamorm)	
  
•  Enterprise	
  sonware	
  integraMon	
  
•  Spreadsheet-­‐style	
  analysis	
  	
  
•  Integrated	
  installaMon	
  of	
  supported	
  open	
  source	
  
and	
  other	
  components	
  
•  Web	
  Console	
  for	
  admin	
  and	
  applicaMon	
  access	
  
•  Plamorm	
  enrichment:	
  addiMonal	
  security,	
  
performance	
  features,	
  .	
  .	
  .	
  	
  	
  	
  
•  World-­‐class	
  support	
  
•  Full	
  open	
  source	
  compaMbility	
  
Business	
  benefits	
   	
  	
  
•  Quicker	
  Mme-­‐to-­‐value	
  due	
  to	
  IBM	
  technology	
  and	
  
support	
  
•  Reduced	
  operaMonal	
  risk	
  
•  Enhanced	
  business	
  knowledge	
  with	
  flexible	
  
analyMcal	
  plamorm	
  
•  Leverages	
  and	
  complements	
  exisMng	
  sonware	
  
Big Insights - Demo
4
3
Big Data Application Ecosystem
Eclipse
App	
  library	
  
MapReduce,	
  …	
  
Text	
  AnalyMcs	
  
Query	
  
App Development
• Code application program, and generate
associated App
• Deploy Apps to Enterprise ManagerApp	
  
Development	
  
Publish
Data	
  integra7on	
  scenario:	
  	
  
Pre-­‐defined	
  work	
  flows	
  simplify	
  
loading	
  data	
  from	
  various	
  sources	
  
• Work	
  flows	
  can	
  be	
  configured,	
  
deployed,	
  executed	
  and	
  
scheduled	
  
Development	
  tooling:	
  
• Text	
  analyMcs	
  	
  
• MapReduce	
  
• Query	
  languages	
  	
  
• 	
  .	
  .	
  .	
  	
  
Applica7on	
  scenarios	
  (web	
  log,	
  
email,	
  social	
  media,	
  …):	
  
• 	
  Samples	
  provide	
  starMng	
  point,	
  
speed	
  Mme	
  to	
  value	
  	
  
Big Data Web Console
Web Console
• Manage BigInsights
Inspect /monitor system health
Add / drop nodes
Start / stop services
Run / monitor jobs (applications)
Explore / modify file system
Create custom dashboards
. . .
• Launch applications
Spreadsheet-like analysis tool
Pre-built applications (IBM
supplied or user developed)
• Publish applications
• Monitor cluster, applications,
data, etc.
Running Applications from the Web Console
•  Import	
  &	
  Export	
  Data	
  
•  Database	
  &	
  Files	
  
•  Web	
  and	
  Social	
  
•  Analyze	
  and	
  Query	
  
•  Predic7ve	
  Analy7cs	
  
•  Text	
  Analy7cs	
  
•  SQL/Hive,	
  Jaql,	
  Pig,	
  HBase	
  
Spreadsheet-style Analysis
•  Web-based analysis and
visualization
•  Spreadsheet-like
interface
Define and manage long
running data collection
jobs
Analyze content of the text
on the pages that have
been retrieved
Get started with BigInsights
•  In the Cloud
Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise
Cloud, or on private clouds.
Pay only for the resources used.
•  In the Classroom
Via IBM Education
Online at www.bigdatauniversity.com
•  On Your Cluster
Download Basic Edition from ibm.com.
•  With the BigInsights Community
– Technical portal @ http://tinyurl.com/biginsights
– BigData on DW @ http://ibm.co/bigdatadev
Links to demos, papers, forum, downloads, etc.
• Stay connected with IBM Big Data
– http://ibmbigdatahub.com
BigDataUniversity.com
Learn Big Data Technologies
• Flexible on-line delivery
allows learning @your place
and @your pace
§ Free courses, free study
materials.
§ Cloud-based sandbox
for exercises – zero setup
§ 66666 registered students.
§ Robust Course
Management System and
Content Distribution
infrastructure-
4
9
50
Big Data is ripe for innovation
Backup slides
OSS in IBM Big Data Platform
5
2
Hadoop	
   	
  -­‐	
  hEp://hadoop.apache.org/	
  
HDFS 	
   	
  -­‐	
  hEp://hadoop.apache.org/docs/r1.0.4/hdfs_design.html	
  
Hive 	
   	
  -­‐	
  hEp://hive.apache.org/	
  
Hbase 	
   	
  -­‐	
  hEp://hbase.apache.org/	
  
Flume 	
   	
  -­‐	
  hEp://flume.apache.org/	
  
Jaql	
   	
   	
  -­‐	
  hEp://code.google.com/p/jaql/wiki/Running	
  
Oozie	
   	
   	
  -­‐	
  hEp://oozie.apache.org/	
  
Sqoop 	
   	
  -­‐	
  hEp://sqoop.apache.org/	
  
Avro 	
   	
  -­‐	
  hEp://avro.apache.org/	
  
Lucene	
   	
   	
  -­‐	
  hEp://lucene.apache.org/	
  
Pigserver 	
  -­‐	
  hEp://pig.apache.org/	
  
Zookeeper 	
  -­‐	
  hEp://zookeeper.apache.org/	
  
Top	
  	
   	
   	
  -­‐	
  http://bigtop.apache.org/
	
  
Build a Big Data Program – MapReduce example
Eclipse tools
For Jaql, Hive, Pig Java MapReduce, BigSheets
plug-ins, text analytics, etc.
BigInsights Text Analytics Development
BigInsights and Text Analytics
• Distills structured info from
unstructured text
Sentiment analysis
Consumer behavior
Illegal or suspicious activities
…
• Parses text and detects meaning
with annotators
• Understands the context in which
the text is analyzed
• Features pre-built extractors for
names, addresses, phone numbers,
etc.
• Built-in support for English,
Spanish, French, German,
Portuguese, Dutch, Japanese,
Chinese
Football World Cup 2010, one team
distinguished themselves well, losing to the
eventual champions 1-0 in the Final. Early in
the second half, Netherlands’ striker, Arjen
Robben, had a breakaway, but the keeper for
Spain, Iker Casillas made the save. Winger
Andres Iniesta scored for Spain for the win.
Unstructured text (document, email, etc)
Classification and Insight
Example Analysis : Extraction from Twitter
messages
Extract intent, interests, life events and micro segmentation attributes
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others
http://4sq.com/gbsaYR
 @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;)
btw happy birthday Sylvia ;)
@rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyle
I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry
now !!!
Monetizable Intent
Relocation
Location
Name, Birth Day
Subtle Spam,
Advertising
Sarcasm,
Wishful Thinking
While accounting for less relevant messages
I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes
http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile
@purplepleather Gotta do more research my Versace term paper 2day. Before I die, I
want a versace purple diamond tiara. Im just sayin>lol
had so much fun today! I want to buy a million dollar house with a wrap around
porch ... ... wading river on the long island sound, ha i wish!

Big Data and OSS at IBM

  • 1.
    Open  Source  SW  @  IBM  Big  Data   Boulder Java User Group 06/11/13 Ivan Portilla ivanp@us.ibm.com portilla@gmail.com Ryan DeJana rdejana@us.ibm.com - 1 -
  • 2.
    Disclaimer ü  This presentationrepresents the view of the authors and does not represent the view of IBM. ü  All opinions expressed in this presentation are strictly of the speakers, and do NOT represent those of IBM, IBM management, or anyone else. ü  IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. ü  Many Thanks to Rafael Coss & Paul Zikopoulos for the materials used in this presentation.
  • 3.
    Agenda ü  Big Data ü OSS in IBM Big Data platform ü  Demo - 3 -
  • 4.
  • 5.
  • 6.
    Big Data Size Equivalence 6 Name   Value   RAMAC   IPOD   1  Giga  (GB)   10^9   200   1  Tera  (TB)   10^12   200K   200   1  Peta  (PB)     10^15   200M   200K   1  Exa  (EB)   10^18   200B   200M   1  ZeEa  (ZB)   10^21   200T   200B  
  • 7.
    Why Didn’t WeUse All of the Big Data Before?
  • 8.
    Big Data IncludesAny of the following Characteristics: Extracting insight in context, beyond what was previously possible. 8 Manage the complexity of multiple relational and non- relational data types and schemas Variety   Streaming data and large volume data movement Velocity   Scale from terabytes to zettabytes Volume  
  • 9.
  • 10.
    Up to 10,000 Times larger Up to10,000 times faster Traditional Data Warehouse and Business Intelligence DataScale DataScale yr mo wk day hr min sec … ms µs Exa Peta Tera Giga Mega Kilo Decision Frequency Occasional Frequent Real-time Data in Motion DataatRest Big Data Has New Opportunities But Needs New Analytics - 1 0 Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics DeepQA 100s GB for Deep Analytics 3 sec/decision Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics
  • 11.
    Applications for BigData Analytics Homeland  Security   Finance    Smarter  Healthcare   MulM-­‐channel  sales   Telecom   Manufacturing   Traffic  Control   Trading  AnalyMcs   Fraud  and  Risk   Log  Analysis   Search  Quality   Retail:  Churn,  NBO  
  • 12.
    U8li8es   §  Weather  impact  analysis  on  power   generaMon   §  Transmission  monitoring   §  Smart  grid  management   Retail   §  360°  View  of  the  Customer   §  Click-­‐stream  analysis   §  Real-­‐Mme  promoMons   Law  Enforcement   §  Real-­‐Mme  mulMmodal  surveillance   §  SituaMonal  awareness   §  Cyber  security  detecMon   Transporta8on   §  Weather  and  traffic   impact  on  logisMcs  and   fuel  consumpMon   §  Traffic  congesMon   Financial Services §  Fraud detection §  Risk management §  360° View of the Customer IT   §  System  log  analysis   §  Cybersecurity   Telecommunica8ons   §  CDR  processing   §  Churn  predicMon   §  Geomapping  /  markeMng   §  Network  monitoring   Most requested use cases of Big Data 12 Health  &  Life  Sciences   §  Epidemic  early  warning   §  ICU  monitoring   §  Remote  healthcare  monitoring   Follow this link for details on Industry Big Data use cases
  • 13.
    13   § Public  wind  data  is  available  on  284km  x  284   km  grids  (2.5o  LAT/LONG)   § More  data  means  more  accurate  and  richer   models  (adding  hundreds  of  variables)   -  Vestas  wind  library  at  2.5  PB:  to  grow  to  over   6  PB  in  the  near-­‐term   -  Granularity  27km  x  27km  grids:  driving  to  9x9,   3x3  to  10m  x  10m  simulaMons   § Reduced  turbine  placement  idenMficaMon  from   weeks  to  hours   § PerspecMve:  The  Vestas  Wind  library,  as  HD  TV   would  take  70  years  to  watch   13  
  • 14.
    14 Big Data Analyticsin Smarter Hospitals IBM Data Baby youtube.com Big  Data  enabled  doctors  from  University  of  Ontario  to  apply  neonatal  infant  monitoring  to   predict  infec8on  in  ICU  24  hours  in  advance     http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
  • 15.
    IBM Watson isa breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working. - 1 5
  • 16.
    - 1 6 Big Data andWatson InfoSphere BigInsights POS Data CRM Data Social Media Distilled Insight -  Spending habits -  Social relationships -  Buying trends Advanced search and analysis Watson can consume insights from
 Big Data for advanced analysis" Big Data technology is used to build Watson’s knowledge base" Watson uses the Apache Hadoop open framework to distribute the workload for loading information into memory." Approx. 200M pages of text (To compete on Jeopardy!) Watson’s Memory
  • 17.
    IBM is committedto Open Source ►  Decade of lineage and contributions to the open source community – Apache Hadoop and Jaql, Apache Derby, Apache Geronimo, Apache Jakarta, +++ – Eclipse: founded by IBM – Significant Lucene contributions via IBM Lucene Extension Library (ILEL) – DRDA, XQuery, SQL, XML4J, XERCES, HTTP, Java, Linux, +++ ►  IBM products built on open source – WebSphere: Apache – Rational: Eclipse and Apache – InfoSphere: Eclipse and Apache, +++ ►  IBM’s BigInsights (Hadoop) is 100% open source compatible with no forks
  • 18.
    Introducing MapReduce ►  In2003 and 2004 Google releases two papers that provide insight into their success – The Google File System – MapReduce: Simplified Data Processing on Large Clusters ►  Introduced an approach to large scale data processing known as MapReduce Global TLE Framework 1 8
  • 19.
    MapReduce ►  A programmingmodel – Inspired by functional programming – Allows expressing distributed computations on large amounts of data ►  Execution framework – Designed for large-scale data processing – Designed to run on clusters of commodity hardware Global TLE Framework 1 9
  • 20.
    MapReduce, the programmingmodel ►  Process key-value records ►  Map function: (Kin, Vin) è list(Kinter, Vinter) ►  Barrier between map and reduce phases – Shuffle and sort phase moves and groups like keys ►  Reduce function: (Kinter, list(Vinter)) è list(Kout, Vout) Global TLE Framework 2 0
  • 21.
    Map phase, word-countexample Global TLE Framework 2 1 (line1, “Hello there.”) (line2, “Why, hello.”) (“hello”,1)   (“there”,1)   (“why”,1)   (“hello”,1)  
  • 22.
    Sort phase, word-countexample Global TLE Framework 2 2 (“hello”, 1) (“hello”, 1) (“there”,  1)   (“why”,  1)  
  • 23.
    Reduce phase, word-countexample Global TLE Framework 2 3 (“hello”, 1) (“hello”, 1) (“there”,  1)   (“why”,  1)   (“hello”, 2) (“there”, 1) (“why”, 1)
  • 24.
    MapReduce, end toend Global TLE Framework 2 4
  • 25.
    Pseudocode for word-count GlobalTLE Framework 2 5 def  mapper(line):      foreach  word  in  line.split():          output(word,  1)     def  reducer(key,  values):      output(key,  sum(values)   Same code can be applied to thousands of lines, even the whole web! Google processes over 20PBs a day, much of it in MapReduce programs.
  • 26.
    But what aboutthe data! Global TLE Framework 2 6 Compute Nodes NAS SAN
  • 27.
    Distributed file systemenables processing to be moved to the data! Global TLE Framework 2 7 (key1, value1) (key2, value2) … (key1, value1) (key2, value2) … Processing is done local to the data Key-value pairs are processed independently and in parallel!
  • 28.
    Hadoop – AM/R Framework ►  Apache open source software framework for reliable, scalable, distributed computing of massive amount of data § Hides underlying system details and complexities from user § Developed in Java ►  Core sub projects: − MapReduce − Hadoop Distributed File System a.k.a. HDFS − Hadoop Common ►  Supported by several Hadoop-related projects § HBase § Zookeeper § Avro § Etc. ►  Meant for heterogeneous commodity hardware
  • 29.
  • 30.
  • 31.
    Hadoop Open SourceProjects ►  Hadoop is supplemented by an ecosystem of open source projects Jaql   Oozie  
  • 32.
    The IBM BigData Platform 32 InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume Data-At-Rest Netezza High Capacity Appliance Queryable Archive for Structured Data Netezza 1000 BI+Ad Hoc Analytics on Structured Data Smart Analytics System Operational Analytics on Structured Data Informix Timeseries Time-structured analytics InfoSphere Warehouse Large volume structured data analytics InfoSphere Streams Low Latency Analytics for streaming data Velocity, Variety & Volume Data-In-Motion MPP  Data  Warehouse   Stream   CompuMng   InformaMon   IntegraMon   Hadoop   InfoSphere Information Server High volume data integration and transformation Apache Hadoop: open source framework for the distributed processing of large data sets across clusters of computers using a simple programming model
  • 33.
    The IBM BigData Platform 33 Integrate  and  manage   the  full  variety,   velocity  and  volume  of   data       Apply  advanced   analy7cs  to   informa7on  in  its   na7ve  form       Visualize  all  available   data  for  ad-­‐hoc   analysis   Development   environment  for   building  new  analy7c   applica7ons       Workload   op7miza7on  and   scheduling         Security  and   Governance  
  • 34.
    BigInsights Brings Hadoopto the Enterprise ►  BigInsights = analytical platform for persistent Big Data –  Based on open source & IBM technologies –  Managed like a start-up . . . . Emphasis on deep customer engagements, product plan flexibility ►  Distinguishing characteristics – Built-in analytics . . . . Enhances business knowledge – Enterprise software integration . . . . Complements and extends existing capabilities – Production-ready platform with tooling for analysts, developers, and administrators. . . . Speeds time-to-value; simplifies development and maintenance ►  IBM advantage – Combination of software, hardware, services and advanced research Hadoop System
  • 35.
    InfoSphere BigInsights Platform forvolume, variety, velocity ►  Enhanced Hadoop foundation Analytics ►  Text analytics & tooling ►  Application accelerators Usability ►  Web console ►  Spreadsheet-style tool ►  Ready-made “apps” Enterprise Class ►  Storage, security, cluster management Integration ►  Connectivity to Netezza, DB2, JDBC databases, etc Apache Hadoop Basic Edition Enterprise Edition Licensed ApplicaMon  accelerators     Pre-­‐built  applicaMons   Text  analyMcs     Spreadsheet-­‐style  tool   RDBMS,  warehouse  connecMvity    AdministraMve  tools,  security   Eclipse  development  tools   Performance  enhancements   .  .  .  .                 Free download Integrated install Online InfoCenter BigData Univ. Breadth of capabilities Enterpriseclass
  • 36.
    BigInsights Basic Edition Connectivityand integration JDBC Flume Infrastructure Jaql Hive Pig HBase MapReduce HDFS ZooKeeper Lucene Oozie Open Source IBM Integrated installer Sqoop HCatalog
  • 37.
    BigInsights Enterprise Edition Connectivityand Integration Streams Netezza Text processing engine and library JDBC Flume Infrastructure Jaql Hive Pig HBase MapReduce HDFS ZooKeeper Indexing Lucene Adaptive MapReduce Oozie Text compression Enhanced security Flexible scheduler Optional IBM and partner offerings Analytics and discovery “Apps” DB2 BigSheets Web Crawler Distrib file copy DB export Boardreader DB import Ad hoc query Machine learning Data processing . . . Administrative and development tools Web console • Monitor cluster health, jobs, etc. • Add / remove nodes • Start / stop services • Inspect job status • Inspect workflow status • Deploy applications • Launch apps / jobs • Work with distrib file system • Work with spreadsheet interface • Support REST-based API • . . . R Eclipse tools • Text analytics • MapReduce programming • Jaql, Hive, Pig development • BigSheets plug-in development • Oozie workflow generation Integrated installer Open Source IBMIBM Cognos BI GPFS (EAP) Accelerator for machine data analysis Accelerator for social data analysis Guardium DataStageData Explorer Sqoop HCatalog
  • 38.
    Open Source ComponentsAcross DistributionsComponent Big Insights 2.0 HortonWorks HDP 1.2 MapR 2.0 Greenplum HD 1.2 Cloudera CDH3u5 Cloudera CDH4* Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 * HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1 Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1 Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2 Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3 Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3 Avro 1.6.3 X X X X X Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0 Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1 HCatalog 0.4.0 0.5.0 0.4.0 X X X BigInsights  con8nues  to  offer  the  most  proven,  stable  versions  of  Apache  Hadoop  components   *Cloudera  CDH4  Hadoop  2.0    includes  Map  Reduce  2.0  which  Cloudera  states  “not  yet  considered  stable”  
  • 39.
    Hadoop Systems 3 9 HDFS   Map/   Reduce     Hive,  Pig  &  Jaql   Sqoop   Zookeeper     Avro  (Serializa8on)   HBase   ETL     Tools   BI     ReporMng   RDBMS  
  • 40.
    BigInsights Content Function Version Basic Edition Enterprise Edition IntegratedInstall Inc Inc Hadoop (including common utilities, HDFS, MapReduce framework) 1.0.3 Inc Inc Jaql (programming / query language) 0.5.2 Inc Inc Pig (programming / query language) 0.10.0 Inc Inc Flume (data collection/aggregation) 0.9.4 Inc Inc Hive (data summarization/querying) 0.9.0 Inc Inc Lucene (text search)* 3.3.0 Inc Inc Zookeeper (process coordination) 3.4.3 Inc Inc Avro (data serialization) 1.6.3 Inc Inc HBase (real time read/write) 0.94.0 Inc Inc HCatalog (table and storage management service) 0.4.0 Inc Inc Sqoop (RDBMS bulk data transfer) 1.4.1 Inc Inc Oozie (workflow/ job orchestration) 3.2.0 Inc Inc Online documentation Inc Inc Integration with JDBC sources through general-purpose Jaql module Inc Inc Integration with DB2 (sample functions to submit jobs, read data) Inc Inc
  • 41.
    BigInsights Content (cont’d)Function Basic Edition Enterprise Edition Integrationwith R (Jaql module to invoke R statistical capabilities from BigInsights) n/a Inc Integration with Netezza, DB2 LUW with DPF from Jaql n/a Inc LDAP authentication, Guardium support, etc. n/a Inc Integrated Web Console n/a Inc Business process accelerators (social data, machine data analytics) n/a Inc Platform performance enhancements (Adaptive MapReduce, large scale indexing, efficient processing of compressed text files, flexible job scheduler, etc.) n/a Inc Text analytics n/a Inc Eclipse tools for text analytic development, Jaql, Hive, Java n/a Inc Applications for data import/export, Web crawl, machine learning, etc. n/a Inc Web-based application catalog n/a Inc Spreadsheet-like analytical tool n/a Inc IBM support Opt Inc Streams, Data Explorer, Cognos BI (limited use licenses) n/a Inc Unlimited storage n/a Inc
  • 42.
    BigInsights: Value BeyondOpen Source Enterprise Capabilities Administration & Security Workload Optimization Connectors Open source components Advanced Engines Visualization & Exploration Development Tools IBM-certified Apache Hadoop or or … Key  differenMators     •  Built-­‐in  analyMcs     •  Text  engine,  annotators,  Eclipse  tooling     •  Interface  to  project  R  (staMsMcal  plamorm)   •  Enterprise  sonware  integraMon   •  Spreadsheet-­‐style  analysis     •  Integrated  installaMon  of  supported  open  source   and  other  components   •  Web  Console  for  admin  and  applicaMon  access   •  Plamorm  enrichment:  addiMonal  security,   performance  features,  .  .  .         •  World-­‐class  support   •  Full  open  source  compaMbility   Business  benefits       •  Quicker  Mme-­‐to-­‐value  due  to  IBM  technology  and   support   •  Reduced  operaMonal  risk   •  Enhanced  business  knowledge  with  flexible   analyMcal  plamorm   •  Leverages  and  complements  exisMng  sonware  
  • 43.
  • 44.
    Big Data ApplicationEcosystem Eclipse App  library   MapReduce,  …   Text  AnalyMcs   Query   App Development • Code application program, and generate associated App • Deploy Apps to Enterprise ManagerApp   Development   Publish Data  integra7on  scenario:     Pre-­‐defined  work  flows  simplify   loading  data  from  various  sources   • Work  flows  can  be  configured,   deployed,  executed  and   scheduled   Development  tooling:   • Text  analyMcs     • MapReduce   • Query  languages     •   .  .  .     Applica7on  scenarios  (web  log,   email,  social  media,  …):   •   Samples  provide  starMng  point,   speed  Mme  to  value     Big Data Web Console
  • 45.
    Web Console • Manage BigInsights Inspect/monitor system health Add / drop nodes Start / stop services Run / monitor jobs (applications) Explore / modify file system Create custom dashboards . . . • Launch applications Spreadsheet-like analysis tool Pre-built applications (IBM supplied or user developed) • Publish applications • Monitor cluster, applications, data, etc.
  • 46.
    Running Applications fromthe Web Console •  Import  &  Export  Data   •  Database  &  Files   •  Web  and  Social   •  Analyze  and  Query   •  Predic7ve  Analy7cs   •  Text  Analy7cs   •  SQL/Hive,  Jaql,  Pig,  HBase  
  • 47.
    Spreadsheet-style Analysis •  Web-basedanalysis and visualization •  Spreadsheet-like interface Define and manage long running data collection jobs Analyze content of the text on the pages that have been retrieved
  • 48.
    Get started withBigInsights •  In the Cloud Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds. Pay only for the resources used. •  In the Classroom Via IBM Education Online at www.bigdatauniversity.com •  On Your Cluster Download Basic Edition from ibm.com. •  With the BigInsights Community – Technical portal @ http://tinyurl.com/biginsights – BigData on DW @ http://ibm.co/bigdatadev Links to demos, papers, forum, downloads, etc. • Stay connected with IBM Big Data – http://ibmbigdatahub.com
  • 49.
    BigDataUniversity.com Learn Big DataTechnologies • Flexible on-line delivery allows learning @your place and @your pace § Free courses, free study materials. § Cloud-based sandbox for exercises – zero setup § 66666 registered students. § Robust Course Management System and Content Distribution infrastructure- 4 9
  • 50.
    50 Big Data isripe for innovation
  • 51.
  • 52.
    OSS in IBMBig Data Platform 5 2 Hadoop    -­‐  hEp://hadoop.apache.org/   HDFS    -­‐  hEp://hadoop.apache.org/docs/r1.0.4/hdfs_design.html   Hive    -­‐  hEp://hive.apache.org/   Hbase    -­‐  hEp://hbase.apache.org/   Flume    -­‐  hEp://flume.apache.org/   Jaql      -­‐  hEp://code.google.com/p/jaql/wiki/Running   Oozie      -­‐  hEp://oozie.apache.org/   Sqoop    -­‐  hEp://sqoop.apache.org/   Avro    -­‐  hEp://avro.apache.org/   Lucene      -­‐  hEp://lucene.apache.org/   Pigserver  -­‐  hEp://pig.apache.org/   Zookeeper  -­‐  hEp://zookeeper.apache.org/   Top        -­‐  http://bigtop.apache.org/  
  • 53.
    Build a BigData Program – MapReduce example Eclipse tools For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc.
  • 54.
  • 55.
    BigInsights and TextAnalytics • Distills structured info from unstructured text Sentiment analysis Consumer behavior Illegal or suspicious activities … • Parses text and detects meaning with annotators • Understands the context in which the text is analyzed • Features pre-built extractors for names, addresses, phone numbers, etc. • Built-in support for English, Spanish, French, German, Portuguese, Dutch, Japanese, Chinese Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Unstructured text (document, email, etc) Classification and Insight
  • 56.
    Example Analysis :Extraction from Twitter messages Extract intent, interests, life events and micro segmentation attributes I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR  @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;) btw happy birthday Sylvia ;) @rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyle I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry now !!! Monetizable Intent Relocation Location Name, Birth Day Subtle Spam, Advertising Sarcasm, Wishful Thinking While accounting for less relevant messages I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile @purplepleather Gotta do more research my Versace term paper 2day. Before I die, I want a versace purple diamond tiara. Im just sayin>lol had so much fun today! I want to buy a million dollar house with a wrap around porch ... ... wading river on the long island sound, ha i wish!