Accelerating Data Science
and
Real-Time Analytics
at Scale
Nadeem Asghar, Hortonworks, Field CTO and
Global Head Partner Engineering
Steve Roberts, IBM, Big Data Offering Manager
Data
Time
Available
Data
Understood
Data
Enterprise
Amnesia
80 million
wearable health
devices will
be available by
2017.
2.5
quintillion
bytes of data
generated daily
by connected
machines.
There
will be
28 times
more
sensor-
enabled
devices
than
people
by the
year 2020.
25 gigabytes
of data per hour
is generated by a
connected car.
90% of cars will
be connected by 2020.
153 exabytes
of healthcare
data generated by
devices in 2013.
Increasing to 2,314
exabytes in 2020.
1.7 megabytes
of data per
second
generated by
every human
being on the
planet by 2020.
Centralized
Mainframes
Cognitive Era
E-Business
Distributed
Computing
Smarter Planet
Office
Productivity
Client/
Server
Personal
Computer
Data
Warehousing
Big Data &
Predictive Analytics
Cognitive
A New Era of Computing Has Emerged
Data InsightContext
Transactional
Database
Business
Intelligence
Big Data &
Analytics
Actionable
Insight in context
Reporting
Cloud
© 2018 IBM Corporation
A recruiting and HR
company, chose an
IBM & Hortonworks
full stack solution to
support their
Hadoop/Spark
workloads and
accelerate their
analytics and AI
projects
Business problem
Job-matching is their core business and accuracy
and speed of this matching is critical to their
success. This requires the intake and analysis of
terabytes of data daily – including recruiter and
company information, job listings, hiring histories,
and resumes. Future requirement to apply AI to
more complex data such as images, sound and
video.
Benefits
• Proven performance
• World class support
• Reliable security for personal data
• Built on open technologies, avoiding vendor
lock-in
• Scalable software defined storage proven
for analytics
• POWER9 and PowerAI supports their AI
research and development
From Data to AIIntelligent Job Matching
accident
risk
rate
90%
inspection
times
10X
number of
inspections
AI at the Edge
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à #1	Pure	Open	Source	Hadoop	Distribution
à 1000+	customers	and	2100+	ecosystem	partners
à Employs	the	original	architects,	developers	and	
operators	of	Hadoop	from	Yahoo!
à Best-in-class	24x7	customer	support
à Leading	professional	services	and	training	
à Data	Science	Leader
à OpenPOWERperformance	leadership
à Flexible,	software	defined	storage
à #1	SQL	Engine	for	complex,	analytical	workloads	
à Leader	in	On-premise	and	Hybrid	Cloud	solutions
+
IBM + Hortonworks = Unlocking Actionable Insights
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DATA – More Volume and More Types
I N C R EAS I N G 	 D ATA	 V AR I ETY	 AN D 	 C O MP L EX I TY
USER	GENERATED	CONTENT
MOBILE	WEB
SMS/MMS
SENTIMENT
EXTERNAL	
DEMOGRAPHICS
HD	VIDEO
SPEECH	TO	TEXT
PRODUCT/
SERVICE	LOGS
SOCIAL	NETWORK
BUSINESS	
DATA	FEEDS
USER	CLICK	STREAM
WEB	LOGS
OFFER	HISTORY DYNAMIC	PRICING
A/B	TESTING
AFFILIATE	
NETWORKS
SEARCH	MARKETING
BEHAVIORAL	TARGETING
DYNAMIC	FUNNELSPAYMENT
RECORD
SUPPORT	
CONTACTS
CUSTOMER	
TOUCHESPURCHASE	DETAIL
PURCHASE
RECORD
SEGMENTATIONOFFER	DETAILS
P E T A B Y T E S
T E R A B Y T E S
G I G A B Y T E S
E X A B Y T E S
ERP
BIG 	 DATA
W EB
CRM
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Business Analytics Must Evolve To Deal With Data Tipping Point
PROVIDE	INSIGHT	INTO	THE	PAST	
via	data	aggregation,	data	mining,	
business	reporting,	OLAP,	
visualization,	dashboards,	etc.
UNDERSTAND	THE	FUTURE
via	statistical	models,	forecasting	
techniques,	machine	learning,	etc.
ADVISE	ON	POSSIBLE	OUTCOMES	
via	rules,	optimization	and	
simulation	algorithms
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data Science and Real-Time Analytics at Scale
End to End Data Science Workflow
Data	Engineering
DISCOVER
ACQUISITION
PROCESSING
CURATION
Data	Science
DATA	
WRANGLING
FEATURE	
ENGINING,VISUALIZATI
ON	AND	ANALYSIS
MODEL	BUILDING,	
TRAINING	AND	
TESTING
Deployment	&	Operationalize
REPORTS
DASHBOARDS
REAL-TIME	
SCORING
BATCH	
SCORING
REST	SERVICES
PERFORMANCE	
MGMT
SCHEDULING
Data	Science	Experience	(DSX)
Enterprise	Services:	Multi	Notebook	Support,	Versioning,	Collaboration,	Model	Management
Hortonworks	Data	Platform	(HDP)
Enterprise	Services:	Data,	GPU,	Deep	Learning,	Compute,	Security,	Governance,	Metadata,	Operations
Hortonworks	Data	Flow	(HDF)
Enterprise	Services:	Data	Ingestion	Schema	Registry,	CEP
Hortonworks	Data	Flow	(HDF)
Enterprise	Services:	Data	Ingestion	Schema	Registry,	CEP
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Use	Case	Deep	Dive	
Credit	Card	Fraud	Prevention
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Building a Model
à Show	of	hands,	how	many	have	built	a	“Model”?
à What	are	some	limitations?
– Conditional	based	logic:		if/else	binary	decisions
à If	you	need	a	lot	of	data	to	build	a	good	model,	what	tools	can	you	use?
– Data	volumes	can	eliminate	the	possibility	of	desktop	tools
à Sampling?
– Well…	 we	better	get	an	even	distribution	of	true	and	false	positives	in	each	sample,	but	wait	that	
requires	data	munging,	back	to	what	tools	can	we	use.
à Security	Concerns?
– Extracting	data	from	it’s	secure	resting	place	and	pushing	it	into	other	environments,	often	times	
unsecure	files	or	desktops	where	Matlab	or	R	can	be	installed.
à Collaboration
– Push	processing	to	the	data	using	modern	distributed	tooling.
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Credit Card Fraud Use Case
à Requirement:	Detect	fraudulent	transactions.			
à Goal:	Save	the	card	company	money	and	build	trust	amongst	card	users.		Cut	down	on	
fraudulent	crime
à Functional	Requirement:	Detect	fraud	in	under	2	seconds	at	point	of	sale.		Learn,	adapt	
and	make	smarter	decisions	over	time.
à Design
– Distance:		How	far	can	one	travel	over	a	period	of	time	before	it	is	fraudulent?
– Category:	How	can	we	detect	a	purchase	that	a	customer	wouldn’t	likely	make?
– Frequency:		How	can	we	detect	purchasing	patterns	that	do	not	resemble	the	card	holder?
à Ideas?
– White	board	some	conditional	logic,	egregiousness	vs	binary
– Back	test	the	data
– Build	a	model	per	card	holder?
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Rules, Statistics, Machine Learning
à Rule	Based	Logic
– Great	for	checking	conditions	that	can	prove	to	be	100%	accurate.		Easy	to	build	and	no	reason	to	
over	engineer.
– Example:	Spending	Limit.		Card	holder	limit	=	$2,000
• If	(currentPurchaseAmount+		balance	>	2,000)	then	deny	transaction
à Statistics
– Mean,	median,	mode,	variance,	deviation
– Anomaly	detection.		Outliers.				(i.e.	womens retail	example)
à Machine	Learning
– Supervised
– Unsupervised
– Trainable
– Adapt	over	time
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Discovery
à Gathered	all	Credit	Card	Transactions
– Problem	is	they	didn’t	make	sense
– No	identifiable	patterns,	no	log	normal	curves
– Gas	$45,	Chipotle	$8.50,	Steak	dinner	$88,	Amazon	shoes	$55
à Classification
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outlier Detection: identify abnormal patterns
Example:	identify	anomalies
Features:
- Time	frequency
- Category	
- Amount
- Distance
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Fraud Detection Demo Technical Architecture
Real-Time	Data	
Movement
(Apache	Nifi)
Real	Time	Processing
(Storm)
Inbound	Messaging
(Kafka)
D A T A I N
M O T I O N
D A T A I N
M O T I O N
Distributed	 Storage:	HDFS
Many	Workloads:	 YARN
Real-time	Serving	(HBase)
Spark
(Machine	Learning)
UI	and	HTTP	PubSub
(Jetty	and	Tomcat)
Data	Science
(DSX)
Resource	Allocation
(Docker)
Interactive	Query
(Hive)
Authorization
(Ranger)
Governance
(Atlas)
All	Running	on	Top	of	IBM	Power	Hardware
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Use	Case	Demo
Credit	Card	Fraud	Prevention
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Page 18
Credit Fraud Analyst Inbox
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Page 19
Credit Fraud Analyst Investigation
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Page 20
Credit Fraud Analyst Action
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Page 21
Hortonworks Data Flow- Backbone for Bi-Directional Communication
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Demo Summary
Problems	Solved
• Data	Scientist	teams	can	collaborate	and	learn	new	tools	on	a	common	frameworks.
• Choice	of	open	source	tools,	notebooks,	and	languages.
• Run	favorite	notebook	on	all	data	in	their	HDP	cluster.
• Deploy	the	model	to	production.
• Leverage	the	production	model	to	deliver	insights	to	business.
• Monitor	the	health	and	performance	of	models	in	production.
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Page 23
Improved	
Experience	
/Reduced	Cost
Immediate	
Customer	
Feedback
Years	of	
Customer	
Transaction	Data
Fraud	Detection
Complete	
Customer	
Profile
Real	time	
ingest	of	
transactions
Proactively	identify	potential	
fraudulent	transactions	to	
protect	the	customer	and	
improve	customer	experience
• Proactively	monitor	every	credit	
card	transaction	using	machine	
learning	to	catch	potential	fraud
• Customer	Service	Analyst	reviews	
flagged	transactions	in	real	time	via	
a	next	generation	application	
running	on	the	connected	platform
• HDF	controls	real	time	flow	of	data	
in	and	out	of	the	connected	
platform	to	the	various	source	and	
destination	points
Innovate
Renovate
Purchase	
Behavior	
Insight
Journey to Fraud Detection
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data Science Solution
Community Open Source Scale & Enterprise Security
• Find tutorials and datasets
• Connect with Data Scientists
• Ask questions
• Read articles and papers
• Fork and share projects
• Code in Scala/Python/R/SQL
• Zeppelin & Jupyter Notebooks
• RStudio IDE and Shiny
• Apache Spark
• Your favorite libraries
• Data Science at Scale
• Run Spark Jobs on HDP Cluster
• Secure Hadoop Support
• Ranger Atlas Support for Data
• Support for ABAC
Model Management
• Data Shaping Pipeline UI
• Auto-data preparation & modeling
• Advanced Visualizations
• Model management & deployment
• Documented Model APIs
Data Science Experience
Freedom:
Choose	the	right	tool	for	
your	team	and	business.
Productivity:
Make	both	experienced	and	
novice	data	scientists	more	
productive.
Trust:
Confidently	deploy	insights	
generated	from	the	most	
current	data	and	trends.
enterprise-ready
software distribution
built on open source
tools for ease
of development
performance
faster training times
for data scientists
+
IBM Power Systems
designed to deliver
breakthrough performance
for data
threads per core
processor cache
memory bandwidth
open innovation
+++
MOREvs.
x86
+ BETTER
L1 ßà L4
COMMUNITY
availability | scalability | reliability | serviceability
get more work done
fastest memory lives on cores
more data than ever is flowing
faster innovation and value
MEANS
26
Accelerate Data Science with Power Systems
Test results based on running a machine learning workload based on k-means clustering algorithm on data sets size ranging from 1GB to 15 GB. Test System details – Power Systems
S822 LC HPC – 20 Cores, 512 GB RAM and SSD, Power Systems S822LC Big Data – 20 Cores, 512 GB, HDDs, Intel Server with Broadwell E5 2640 v4 – 20 cores, 512 GB and SSD,
Intel Server with Broadwell E5 2699 v4 – 44 cores, 512 GB, HDD
• Increase Data Science Team productivity
• Reduce model training time
− 2.5X with S822LC for HPC vs E5-2640 v4
(with SSD)
− 1.5X with S822LC for Big Data vs E5-2699 v4
(with HDD)
• Leverage larger datasets for model
training
• 2.5X larger dataset in the same time (1200 Seconds -
~5GB for x86 server E5 2640 with SSD vs 13GB for
Power server S822 LC HPC with SSD)
0
600
1200
1800
2400
3000
3600
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Data Size (GB)
Elapsed time to form 5 clusters in 100 Iterations using
k-means clustering with one user
S822LC HPC with SSD S822LC BigData with HDD
E5 2699 v4 with HDD E5 2640 v4 with SSD
ElapsedTime(seconds)
The Perfect Blend of Data Science and an Enterprise Data Lake
28
Better
Together
datascience.ibm.com
Boost Data Science Team
Productivity: model training
in less than half the time
versus x86
Blazing Fast Insights for Line
of Business: A 1.7x
improvement in time to result
Secure and Reliable Data Access at Scale: Open, comprehensive data
lifecycle and security management on the most reliable servers.
For clients building a high
performing Data Science
practice with a fast, scalable,
enterprise Data Lake
Acomplete solution of Data Science
and Hadoop software, hardware and
quick start services.
29 © 2016 IBM Corporation
Image Name Software Versions Linux Version
HDP 2.6.2 HDP 2.6.2 RHEL 7.3
HDP 2.6.4 HDP 2.6.4 RHEL 7.4
HDP/HDF Security Governance Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4
HDP/HDF Credit Card Fraud Detection Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4
HDP/HDF IOT Trucking Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4
Hortonworks Preconfigured Images available on IBM POWER8
Size Flavor Options Description
Small 8 vCPUs, 24GB memory, 50GB disk
Medium 16 vCPUs, 32GB memory, 200GB disk
Large 24 vCPUs, 48GB memory, 500GB disk
1. Go to IBM Power DevelopmentCloud (PDC):Link
2. Follow the Get Started process via the “Go to Program to Get Started” link and register for IBM PDC as a Partner or Open
Source Developer
3. When you reach the IBM PDC “Make a Reservation” page,click Requestpromo code
4. SelectRed Hat Linux for the Image Category.Enter the vCPUs and memory using values from the size/flavor options in the
table below.In Other requirements field,enter one of the Image names from the table below.Click Submit.
5. Wait for an approval email.Then,follow the instructions in the Create Reservation guide to complete your reservation.
6. On the reservations page,select the company profile that shows VMaaS, enter the Promo code received in the email,and
click Apply.
7. In the next form, select the desired Flavor and Image name.
How to Get Started with Hortonworks on OpenPOWER Systems
• Learn more about the benefits of IBM Power Systems and OpenPOWER
• Join the Hortonworks Community: https://community.hortonworks.com/
• Learn more about the benefits of Hortonworks: http://hortonworks.com/training/
• Sign up for Free Data Science and Cognitive Computing courses:
https://cognitiveclass.ai/
• Try the solution: IBM benchmark centers, on the cloud or on your premise
Q&A
IBM Cloud / DOC ID / Month XX, 2017 / © 2017 IBM Corporation
Thank you
IBM Cloud / DOC ID / Month XX, 2017 / © 2017 IBM Corporation

Accelerating Data Science and Real Time Analytics at Scale

  • 1.
    Accelerating Data Science and Real-TimeAnalytics at Scale Nadeem Asghar, Hortonworks, Field CTO and Global Head Partner Engineering Steve Roberts, IBM, Big Data Offering Manager
  • 2.
    Data Time Available Data Understood Data Enterprise Amnesia 80 million wearable health deviceswill be available by 2017. 2.5 quintillion bytes of data generated daily by connected machines. There will be 28 times more sensor- enabled devices than people by the year 2020. 25 gigabytes of data per hour is generated by a connected car. 90% of cars will be connected by 2020. 153 exabytes of healthcare data generated by devices in 2013. Increasing to 2,314 exabytes in 2020. 1.7 megabytes of data per second generated by every human being on the planet by 2020.
  • 3.
    Centralized Mainframes Cognitive Era E-Business Distributed Computing Smarter Planet Office Productivity Client/ Server Personal Computer Data Warehousing BigData & Predictive Analytics Cognitive A New Era of Computing Has Emerged Data InsightContext Transactional Database Business Intelligence Big Data & Analytics Actionable Insight in context Reporting Cloud
  • 4.
    © 2018 IBMCorporation A recruiting and HR company, chose an IBM & Hortonworks full stack solution to support their Hadoop/Spark workloads and accelerate their analytics and AI projects Business problem Job-matching is their core business and accuracy and speed of this matching is critical to their success. This requires the intake and analysis of terabytes of data daily – including recruiter and company information, job listings, hiring histories, and resumes. Future requirement to apply AI to more complex data such as images, sound and video. Benefits • Proven performance • World class support • Reliable security for personal data • Built on open technologies, avoiding vendor lock-in • Scalable software defined storage proven for analytics • POWER9 and PowerAI supports their AI research and development From Data to AIIntelligent Job Matching
  • 5.
  • 6.
    6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ã#1 Pure Open Source Hadoop Distribution à 1000+ customers and 2100+ ecosystem partners à Employs the original architects, developers and operators of Hadoop from Yahoo! à Best-in-class 24x7 customer support à Leading professional services and training à Data Science Leader à OpenPOWERperformance leadership à Flexible, software defined storage à #1 SQL Engine for complex, analytical workloads à Leader in On-premise and Hybrid Cloud solutions + IBM + Hortonworks = Unlocking Actionable Insights
  • 7.
    7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA– More Volume and More Types I N C R EAS I N G D ATA V AR I ETY AN D C O MP L EX I TY USER GENERATED CONTENT MOBILE WEB SMS/MMS SENTIMENT EXTERNAL DEMOGRAPHICS HD VIDEO SPEECH TO TEXT PRODUCT/ SERVICE LOGS SOCIAL NETWORK BUSINESS DATA FEEDS USER CLICK STREAM WEB LOGS OFFER HISTORY DYNAMIC PRICING A/B TESTING AFFILIATE NETWORKS SEARCH MARKETING BEHAVIORAL TARGETING DYNAMIC FUNNELSPAYMENT RECORD SUPPORT CONTACTS CUSTOMER TOUCHESPURCHASE DETAIL PURCHASE RECORD SEGMENTATIONOFFER DETAILS P E T A B Y T E S T E R A B Y T E S G I G A B Y T E S E X A B Y T E S ERP BIG DATA W EB CRM
  • 8.
    8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved BusinessAnalytics Must Evolve To Deal With Data Tipping Point PROVIDE INSIGHT INTO THE PAST via data aggregation, data mining, business reporting, OLAP, visualization, dashboards, etc. UNDERSTAND THE FUTURE via statistical models, forecasting techniques, machine learning, etc. ADVISE ON POSSIBLE OUTCOMES via rules, optimization and simulation algorithms
  • 9.
    9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataScience and Real-Time Analytics at Scale End to End Data Science Workflow Data Engineering DISCOVER ACQUISITION PROCESSING CURATION Data Science DATA WRANGLING FEATURE ENGINING,VISUALIZATI ON AND ANALYSIS MODEL BUILDING, TRAINING AND TESTING Deployment & Operationalize REPORTS DASHBOARDS REAL-TIME SCORING BATCH SCORING REST SERVICES PERFORMANCE MGMT SCHEDULING Data Science Experience (DSX) Enterprise Services: Multi Notebook Support, Versioning, Collaboration, Model Management Hortonworks Data Platform (HDP) Enterprise Services: Data, GPU, Deep Learning, Compute, Security, Governance, Metadata, Operations Hortonworks Data Flow (HDF) Enterprise Services: Data Ingestion Schema Registry, CEP Hortonworks Data Flow (HDF) Enterprise Services: Data Ingestion Schema Registry, CEP
  • 10.
  • 11.
    11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Buildinga Model à Show of hands, how many have built a “Model”? à What are some limitations? – Conditional based logic: if/else binary decisions à If you need a lot of data to build a good model, what tools can you use? – Data volumes can eliminate the possibility of desktop tools à Sampling? – Well… we better get an even distribution of true and false positives in each sample, but wait that requires data munging, back to what tools can we use. à Security Concerns? – Extracting data from it’s secure resting place and pushing it into other environments, often times unsecure files or desktops where Matlab or R can be installed. à Collaboration – Push processing to the data using modern distributed tooling.
  • 12.
    12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CreditCard Fraud Use Case à Requirement: Detect fraudulent transactions. à Goal: Save the card company money and build trust amongst card users. Cut down on fraudulent crime à Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt and make smarter decisions over time. à Design – Distance: How far can one travel over a period of time before it is fraudulent? – Category: How can we detect a purchase that a customer wouldn’t likely make? – Frequency: How can we detect purchasing patterns that do not resemble the card holder? à Ideas? – White board some conditional logic, egregiousness vs binary – Back test the data – Build a model per card holder?
  • 13.
    13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rules,Statistics, Machine Learning à Rule Based Logic – Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to over engineer. – Example: Spending Limit. Card holder limit = $2,000 • If (currentPurchaseAmount+ balance > 2,000) then deny transaction à Statistics – Mean, median, mode, variance, deviation – Anomaly detection. Outliers. (i.e. womens retail example) à Machine Learning – Supervised – Unsupervised – Trainable – Adapt over time
  • 14.
    14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Discovery ÃGathered all Credit Card Transactions – Problem is they didn’t make sense – No identifiable patterns, no log normal curves – Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55 Ã Classification
  • 15.
    15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved OutlierDetection: identify abnormal patterns Example: identify anomalies Features: - Time frequency - Category - Amount - Distance
  • 16.
    16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FraudDetection Demo Technical Architecture Real-Time Data Movement (Apache Nifi) Real Time Processing (Storm) Inbound Messaging (Kafka) D A T A I N M O T I O N D A T A I N M O T I O N Distributed Storage: HDFS Many Workloads: YARN Real-time Serving (HBase) Spark (Machine Learning) UI and HTTP PubSub (Jetty and Tomcat) Data Science (DSX) Resource Allocation (Docker) Interactive Query (Hive) Authorization (Ranger) Governance (Atlas) All Running on Top of IBM Power Hardware
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page21 Hortonworks Data Flow- Backbone for Bi-Directional Communication
  • 22.
    22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DemoSummary Problems Solved • Data Scientist teams can collaborate and learn new tools on a common frameworks. • Choice of open source tools, notebooks, and languages. • Run favorite notebook on all data in their HDP cluster. • Deploy the model to production. • Leverage the production model to deliver insights to business. • Monitor the health and performance of models in production.
  • 23.
    23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page23 Improved Experience /Reduced Cost Immediate Customer Feedback Years of Customer Transaction Data Fraud Detection Complete Customer Profile Real time ingest of transactions Proactively identify potential fraudulent transactions to protect the customer and improve customer experience • Proactively monitor every credit card transaction using machine learning to catch potential fraud • Customer Service Analyst reviews flagged transactions in real time via a next generation application running on the connected platform • HDF controls real time flow of data in and out of the connected platform to the various source and destination points Innovate Renovate Purchase Behavior Insight Journey to Fraud Detection
  • 24.
    24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataScience Solution Community Open Source Scale & Enterprise Security • Find tutorials and datasets • Connect with Data Scientists • Ask questions • Read articles and papers • Fork and share projects • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE and Shiny • Apache Spark • Your favorite libraries • Data Science at Scale • Run Spark Jobs on HDP Cluster • Secure Hadoop Support • Ranger Atlas Support for Data • Support for ABAC Model Management • Data Shaping Pipeline UI • Auto-data preparation & modeling • Advanced Visualizations • Model management & deployment • Documented Model APIs Data Science Experience Freedom: Choose the right tool for your team and business. Productivity: Make both experienced and novice data scientists more productive. Trust: Confidently deploy insights generated from the most current data and trends.
  • 25.
    enterprise-ready software distribution built onopen source tools for ease of development performance faster training times for data scientists +
  • 26.
    IBM Power Systems designedto deliver breakthrough performance for data threads per core processor cache memory bandwidth open innovation +++ MOREvs. x86 + BETTER L1 ßà L4 COMMUNITY availability | scalability | reliability | serviceability get more work done fastest memory lives on cores more data than ever is flowing faster innovation and value MEANS 26
  • 27.
    Accelerate Data Sciencewith Power Systems Test results based on running a machine learning workload based on k-means clustering algorithm on data sets size ranging from 1GB to 15 GB. Test System details – Power Systems S822 LC HPC – 20 Cores, 512 GB RAM and SSD, Power Systems S822LC Big Data – 20 Cores, 512 GB, HDDs, Intel Server with Broadwell E5 2640 v4 – 20 cores, 512 GB and SSD, Intel Server with Broadwell E5 2699 v4 – 44 cores, 512 GB, HDD • Increase Data Science Team productivity • Reduce model training time − 2.5X with S822LC for HPC vs E5-2640 v4 (with SSD) − 1.5X with S822LC for Big Data vs E5-2699 v4 (with HDD) • Leverage larger datasets for model training • 2.5X larger dataset in the same time (1200 Seconds - ~5GB for x86 server E5 2640 with SSD vs 13GB for Power server S822 LC HPC with SSD) 0 600 1200 1800 2400 3000 3600 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data Size (GB) Elapsed time to form 5 clusters in 100 Iterations using k-means clustering with one user S822LC HPC with SSD S822LC BigData with HDD E5 2699 v4 with HDD E5 2640 v4 with SSD ElapsedTime(seconds)
  • 28.
    The Perfect Blendof Data Science and an Enterprise Data Lake 28 Better Together datascience.ibm.com Boost Data Science Team Productivity: model training in less than half the time versus x86 Blazing Fast Insights for Line of Business: A 1.7x improvement in time to result Secure and Reliable Data Access at Scale: Open, comprehensive data lifecycle and security management on the most reliable servers. For clients building a high performing Data Science practice with a fast, scalable, enterprise Data Lake Acomplete solution of Data Science and Hadoop software, hardware and quick start services.
  • 29.
    29 © 2016IBM Corporation Image Name Software Versions Linux Version HDP 2.6.2 HDP 2.6.2 RHEL 7.3 HDP 2.6.4 HDP 2.6.4 RHEL 7.4 HDP/HDF Security Governance Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4 HDP/HDF Credit Card Fraud Detection Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4 HDP/HDF IOT Trucking Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4 Hortonworks Preconfigured Images available on IBM POWER8 Size Flavor Options Description Small 8 vCPUs, 24GB memory, 50GB disk Medium 16 vCPUs, 32GB memory, 200GB disk Large 24 vCPUs, 48GB memory, 500GB disk 1. Go to IBM Power DevelopmentCloud (PDC):Link 2. Follow the Get Started process via the “Go to Program to Get Started” link and register for IBM PDC as a Partner or Open Source Developer 3. When you reach the IBM PDC “Make a Reservation” page,click Requestpromo code 4. SelectRed Hat Linux for the Image Category.Enter the vCPUs and memory using values from the size/flavor options in the table below.In Other requirements field,enter one of the Image names from the table below.Click Submit. 5. Wait for an approval email.Then,follow the instructions in the Create Reservation guide to complete your reservation. 6. On the reservations page,select the company profile that shows VMaaS, enter the Promo code received in the email,and click Apply. 7. In the next form, select the desired Flavor and Image name.
  • 30.
    How to GetStarted with Hortonworks on OpenPOWER Systems • Learn more about the benefits of IBM Power Systems and OpenPOWER • Join the Hortonworks Community: https://community.hortonworks.com/ • Learn more about the benefits of Hortonworks: http://hortonworks.com/training/ • Sign up for Free Data Science and Cognitive Computing courses: https://cognitiveclass.ai/ • Try the solution: IBM benchmark centers, on the cloud or on your premise
  • 31.
    Q&A IBM Cloud /DOC ID / Month XX, 2017 / © 2017 IBM Corporation
  • 32.
    Thank you IBM Cloud/ DOC ID / Month XX, 2017 / © 2017 IBM Corporation