Data Analytics in the Cloud presentation at SOA World, part of the SOA & Cloud Computing track, focus on open source software, SOA, data analytics, Apache Hadoop
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Data Analytics In The Cloud Soa World
1. Open Source SOA in
the Cloud: Data
Analytics in the Cloud
Tom Plunkett TomPlunkett@vt.edu
Michael Sick michael.sick@serenesoftware.com
SOA World 2009
2. Overview
• Who are we?
Introductions
• Baselines & definitions
• Targeted Use Cases
Opportunity • Technical convergence & opportunities
• Commercial opportunities & drivers
• State of current technology
Data Analytics Technology &
• Commercial & FOSS solutions
in the Cloud Standards
• Hadoop Focus
• Challenges to Meet Target Use Cases
Challenges • Economic challenges & the role of “free”
• Wide scale challenges in Cloud and data analytics
• Questions
Questions
• Contacts
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 2
License
3. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Introductions
Challenges
Questions
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 3
License
4. Introductions
Opportunity
Tom Plunkett
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Extensive Federal Government Experience
IBM Certified SOA Solution Designer
Patents
Teach OOP and Java for Virginia Tech
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 4
License
5. Introductions
Opportunity
Michael Sick
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Commercial & Federal Enterprise Architect
Owner: Serene Software Inc. – EA Services Firm
Clients include: BAE, USAF, Raytheon, BearingPoint,
McGraw-Hill, Sun Microsystems, Badcock Furniture
Fascinated by technology -15 years running
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 5
License
6. Introductions
Opportunity
Serene Software
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
• Serene is a boutique consulting company focusing on
delivery of Enterprise Architecture services and solutions
• Service Areas
– IT Governance
– IT Strategy
– IT Cost Containment
– Service Oriented Architectures (SOA)
– IT Solution Selection
– IT Audit & Analysis
• Experience includes: BAE, USAF, Raytheon, BearingPoint,
McGraw-Hill, Sun Microsystems, Badcock Furniture, …
• Founded in 2003 (privately held, no debt) and
headquartered in Jacksonville, FL
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 6
License
7. Introductions
Opportunity
Draft NIST Definition of Cloud Computing
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
A model for enabling convenient, on-demand network access to a shared pool
of configurable computing resources that can be rapidly provisioned and relea-
sed with minimal management effort or service provider interaction
Essential Characteristics Delivery Models Deployment Models
• On-demand self-service • Cloud Software as a • Private cloud
Service (SaaS)
• Ubiquitous network access • Community cloud
• Cloud Platform as a Service
• Location independent • Public cloud
(PaaS)
resource pooling
• Hybrid cloud
• Cloud Infrastructure as a
• Rapid elasticity
Service (IaaS)
• Measured Service
Source: Draft NIST Definition of Cloud Computing, 06/2009
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 7
License
8. Introductions
Opportunity
OSI Open Source Definition
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Free Redistribution
Source Code
Derived Works
Integrity of The Author's Source Code
No Discrimination Against Persons or Groups
No Discrimination Against Fields of Endeavor
Distribution of License
License Must Not Be Specific to a Product
License Must Not Restrict Other Software
License Must Be Technology-Neutral
Source: http://www.opensource.org/docs/osd
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 8
License
9. Introductions
Opportunity
The Open Group SOA Definition
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Service-Oriented Architecture (SOA) is an architectural
style that supports service orientation
Service orientation is a way of thinking in terms of services
and service-based development and the outcomes of services
Source: http://www.opengroup.org/projects/soa/doc.tpl?gdid=10632
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 9
License
10. Introductions
Data Clouds & Data Grids – What‘s the Data Analytics
in the Cloud
Opportunity
Technology &
Standards
difference?
Challenges
Questions
Often Data Clouds & Data Grids are used inter-
changeably, we make the following distinctions
Data Grids Data Clouds
• Grid computing system optimized to share • Focuses on perception of infinite storage,
large amounts of distributed data computing capacity
• Focus on technical capabilities • Focus on cost, virtualization & flexible
capacity
• Often combined with computational grid
computing systems • Enables scale-up/scale-down economics
• Data often moved to compute grid for use • Data moved rarely, locality is a key feature
• Often oriented towards highly structured • Clouds thus far focusing on column
scientific data computing applications oriented, massively scalable data stores
Sources: Wikipedia & [Grossman 1]
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 10
License
11. Introductions
Opportunity
Definition: Mashups
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Web available resource that combines data/functions
from two or more external resources
Idea of mashup efforts is to reduce the cost of
producing and consuming resources
Integration should be fast, easy
Often focuses on widely available formats/protocols
like RSS or Atom over HTTP
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 11
License
12. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Opportunities
Challenges
Questions
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 12
License
13. Introductions
Use Case: Cloud Data Analytical Tools for Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Intelligence Community Field Analyst
Challenges
Questions
Problem Statement: Analytical Tools Obsolete On Deployment,
field analysts need timely, configurable data analytics. How
does cloud based DA meet the needs of IC analysts
Cloud Analytical
Customer Problem Customer Value
Tools Solution
• Traditional business • Recomposable Cloud • Enabling field analysts to
intelligence tools require Computing Data Analytical quickly build the analytical
years to develop Tools tool they need to analyze
petabytes of data
• Field Analysts confront – Apache Hadoop
situations which are rapidly
– Mashups
changing
– Service-Oriented
• Petabytes of data require
Architecture
analysis
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 13
License
14. Introductions
Why the “Buzzword” Soup? Convergence Data Analytics
in the Cloud
Opportunity
Technology &
Standards
of Capabilities
Challenges
Questions
Convergence of capabilities
Free Open New opportunities in breadth
Source and depth of DA services
Software • Big Data: Cloud disk and data
(FOSS) storage engines make peta-
byte environments available
to new clients
• Value Based Billing: Heavy
Virtual- Cloud Data use of FOSS in the cloud
SaaS reduces costs directly &
ization Computing Analytics
indirectly
• Capacity Scaling: Scaling
up/down of capacity in pay-go
fashion makes DA available to
wider audience
Mashups • Composable UI’s: Capability
to assemble DA results into
various interfaces
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 14
License
15. Introductions
Early Data Analytic Cloud Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Consumers/Providers
Challenges
Questions
Profile Types Example Companies
Big Internet Companies • Yahoo, Amazon – can build DA on inf.
Internet Scale
Services
Service SaaS Companies • Force.com – DA & Warehousing to SBA’s
Providers • Facebook – sell DA access to anon. user info
Social Platforms
Insurers • BCBS – private clouds across consortium
Services
Large data-
centric Tradi- Healthcare & Biotech • Kaiser Permanente – common DA services
Cloud DA tional Co’s
Rating Agencies • S & P – open DA cloud to customers
Oppor-
tunities
Intelligence Community • CIA –private org-wide Cloud
Services
Government
Defense Managed Services • DISA -- offer DA to .mil clients
Organizations
Healthcare • SSA – offer DA to fraud prevention analysts
Services
DAaas Infrastructure • Cloudera –managed Hadoop instances
DAaaS
Providers SMB DAaaS Provider • ?? – managed DAaaS, simplified, low cost
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 15
License
16. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Technology & Standards
Challenges
Questions
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 16
License
17. Introductions
Opportunity
Google MapReduce
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Algorithm for computing distributed problems using a
divide and conquer approach with a cluster of nodes
Master node Maps input into smaller sub-problems and distributes
the work to the cluster. A worker node may further map the work
for a further cluster of nodes. The worker nodes then process the
smaller problems, and return the answers back to the master node
Master node then Reduces the set of answers into the answer to the
original problem
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 17
License
18. Introductions
Opportunity
Apache Hadoop
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Open Source implementation of the MapReduce algorithms
Hadoop can store and process petabytes of data
Subprojects include HBase, Chukwa, Hive, Pig, and ZooKeeper
Yahoo (more than 100,000 CPUs in >25,000 computers
running Hadoop) and other companies make extensive use of Hadoop
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 18
License
19. Introductions
As-Is Hadoop Simplified Reference Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Architecture
Challenges
Questions
Chukwa HBase
Structured Data
Apache Hadoop
Unstructured
Zookeeper
Data
Business
ETL Pig Hive
Intelligence
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 19
License
20. Introductions
Opportunity
Apache Hadoop Sub-projects
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Hadoop Sub-
Capabilities Example Companies
projects
Chukwa • Data collection system for monitoring and • Yahoo
analyzing large distributed systems
HBase • Similar to Google’s BigTable • Yahoo
• Distributed database for structured data
• Multi-dimensional sorted map
Hive • Data warehouse infrastructure for large • Facebook
datasets
• Hive QL query language
Pig • High-level language for data analysis • Yahoo
• Compiler for Map-Reduce programs
Zookeeper • Configuration, Naming, Distributed • Yahoo
Synchronization, and group services
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 20
License
21. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Challenges
Challenges
Questions
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 21
License
22. Introductions
Opportunity
To-Be Simplified Hadoop Architecture
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
REST API
HBase
SOAP API
Business Structured
Intelligence Data
Query Apache Hadoop
Language Unstructured
Pig Chukwa Zookeeper Data
Hive
Algorithm
Library
ETL
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 22
License
23. Introductions
Opportunity
Key Challenges
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Hardware Speed of Rack Interconnects, Multi-core
Infrastructure Parallelization Core platform, Data Analytic Components
Node Affinity Make use of super nodes, XML i/o, en/de-crypt
Cost “brutally efficient” pricing, FOSS advantages
Adoption Cost Models Accurate, open models of CapEx, OpEx costs
Migration Pain Full warehouse migration, ETL,
Ease of Admin. Parallel current RDBMS, Warehouse admin
Debugging Distributed debugging, integration w/ Provider
Emerging Administration
Challenges Flexible Provisioning Multi-level provisioning – co., dept, individual
System Reporting Reporting, audit trails, view to DA system
ETL Integration Interface, metadata optimized for ETL loading
Input & Analysis Intuitive API’s Declarative & programmatic cross language
Product Integration BI, Applications (SAP, Oracle Financial, Lawson)
Data Visualization Viewing & drill down of very large data sets
Output Intuitive API’s Declarative & programmatic cross language
Mashups/Dynamics Easy discovery of data & functions & workflows
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 23
License
24. Introductions
Opportunity
Solutions: Projected & In-Progress
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Hardware Interconnect $$ dropping, hardware maturing
Infrastructure Parallelization Platforms advance, market for components
Node Affinity Discovery of capability, affinity into Hadoop, …
Cost FOSS’s game to loose, small diff * a lot = a lot
Adoption Cost Models Industry standard ROI/IRR models for CC
Migration Pain Migration toolkits for traditional DW products
Ease of Admin. Integrated & extended admin packages
Debugging Commercial distributed debugging
Emerging Administration
Challenges Flexible Provisioning Multi-level provisioning – co., dept, individual
System Reporting Reporting, audit trails, view to DA system
ETL Integration ETL interface, support of popular packages
Input & Analysis Intuitive API’s SQL like interface in core, language bindings
Product Integration 3rd party adaptors, IWay et al
Data Visualization Modeling, meta-data, traceability, and new UI’s
Output Intuitive API’s SQL like interface in core, language bindings
Mashups/Dynamics Generic datatypes, discovery services
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 24
License
25. Introductions
Data Analytics in the Cloud: Data Analytics
in the Cloud
Opportunity
Technology &
Standards
Questions
Challenges
Questions
Introductions
Opportunity
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 25
License
26. Introductions
Opportunity
Question? & Contact Information
Data Analytics Technology &
in the Cloud Standards
Challenges
Questions
Principle Architect / Partner Cloud Computing Architect
Michael A. Sick Tom Plunkett
888.777.1847 888.777.1847
michael.sick@serenesoftware.com TomPlunkett@vt.edu
Address Address
Serene Software Serene Software
116 19th Ave. North, Suite 503 116 19th Ave. North, Suite 503
Jacksonville Beach, FL Jacksonville Beach, FL
URL: www.serenesoftware.com URL: www.serenesoftware.com
This work is licensed under a Creative Tom Plunkett & Michael Sick
Commons Attribution 3.0 United States 26
License