SlideShare a Scribd company logo
1 of 156
Download to read offline
Paving the Way to
"Data Driven”
Mohd izhar firdaus ismail
solution architect
abyres enterprise technologies sdn bhd
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 2 / 156
About Me
● About Me
– Mohd Izhar Firdaus Bin Ismail
– Solution Architect & Head Data Engineering
Department, ABYRES Enterprise Technologies
Sdn Bhd
● About ABYRES
– System Integrator company focusing in consulting
and implementation of state of the art solutions
around Open Source IT infrastructure and data
center modernization
● Data Engineering & Big Data
● IT Modernization
● Enterprise Mobility Platform
● Unix to Linux Migration
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 3 / 156
Outline
Demystifying
Big Data
History
of Big Data
Impact of
Big Data
Evolution
Of Data
Management
Big Data
Architectures
Data
Collection
Internet
Of
Things
Tools &
Technologies
Open Source
License
Framework
For Data
Journey
Hortonworks
Case Studies
Demystifying big data
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 5 / 156
BackToBasics
Input Process Output
Storage
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 6 / 156
Traditional Computing :“Small”Data
Input Process Output
Storage
Low to medium rate of
data coming in, can be
easily collected using
softwares /
applications that runs
on single-core, or
multi-threaded
Processing low to medium amount of data can be easily
done in using simple architectures, processing data in
single/multi-core environments. Managing storage of data
is also simple, using single disk or an array of disks
merged together in RAID, in a single machine.
Outputs are showing
simple results that
easily can be viewed
using client softwares
that can even load the
whole dataset, while
still giving good
performance
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 7 / 156
BigData Computing :Massive Data!
Input Process Output
Storage
High volume and
velocity of data coming
in calls for a totally
different breed of data
collection and pipelining
software that can run in
distributed environment,
across thousands of
cores across hundreds
of computers
High volume of data with complex processing needs, especially in processing complex
relationships and complex unstructured data requires high throughput distributed computing to
process data and get results on time for business to use.
RAID arrays no longer enough to store the high volume of data, calling for distributed storage that
can easily scale to cater to additional data that are coming in with high velocity
Analyzing and visualizing
massive amount of data
to make sense of its
complex relationships
can’t be easily done
through basic charts,
calling for new
visualization techniques,
and strategy to minimize
client-side processing of
visualization in order to
have good rendering
performance
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 8 / 156
CriteriaofBig Data
● Volume
– Data coming from various sources and
increased regulation in multiple areas means
storing more data for longer period of time
– Gigabytes, Terabytes, Petabytes …...
Zetabytes
● Velocity
– Machine data as well data coming from new
sources are being ingested ad speeds not
even imagined a few years ago.
– 1MB/s , 10MB/s, 50MB/s growth rate and
beyond
● Variety
– Unstructured and semi-structured data is
becoming as strategic as structured data
– Video, audio, images, free text
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 9 / 156
Uncaptured & UnanalyzedData – AMissed Opportunity
All organizations have data lying around, either not yet
captured, poorly captured, or captured but not analyzed.
The data may contain hidden gems for improving decision
making, leaving them alone is a missed opportunity
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 10 / 156
Big DataProcessing TechnologiesareNOT BigData
● It is a common misconception that if one is adopting Hadoop, Spark, etc, they are adopting Big
Data. However, this is not true.
● Big Data is that massive amount of data you either have collected, or have opportunity to
collect, which you unable to collect and process, due to computing limitations or cost limitations.
● Adopting Hadoop, Spark or whatever Big Data technologies, without a strong data collection
and analysis strategy will not give you the benefits that you might want.
!=
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 11 / 156
How BigISBig?
● A common question – How big should my data be, for it to be considered Big Data?
● The answer is pretty subjective depending on organization, data, analytical processes, and
outputs you are dealing with. But ask yourself these questions:
Do you have a Big
Data problem?
Is your current data
architecture /
infrastructure, able to
collect and produce the
output you require, in
timely manner?
Do you plan to collect more
and more data, terabytes and
beyond, with the goal of
analyzing them in very rapid
manner, and you want to
archive the raw data for long
period of time?
Yes No
You are likely not
dealing with a Big Data
problem
No Yes You are likely dealing
with a Big Data problem,
or just optimization
problem
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 12 / 156
data– the new oil
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 13 / 156
Basicconceptsof petroleum mining
Petroleum reserve
exists in wells and
shale
Mining equipment
extract petroleum
from wells
Petroleum pipelines
transport petroleum
to silos and refineries
Refineries refine petroleum
to create petroleum based
products for consumers
Silos stores petroleum before
they are processed
Petroleum engineers design, construct, and maintain
petroleum mining, pipeline, silo and refineries
Petroleum scientists research
on petroleum to create new
products and applications using
components in petroleum
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 14 / 156
DATAMining & Analytics
Data exists in
environment
Sensors and
data collection
software extract
data from
environment
Ingestion / ETL data
pipelines bring raw
data to central
data repository
Data refineries / processing software
processes data to extract analytical
results for use by data consumers
Data repositories / databases
stores data for analytics
purposes
Data engineers design, construct, and maintain
data mining, pipeline, repositories and processing
infrastructure
Data scientists research
on data to create new
products and applications using
analytical results from data
We are your data
engineers!
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 15 / 156
Handling Big Data :Data Science Vs DataEngineering
● Data Science / Data Analysis
– Extract value from data
– Descriptive/Predictive/Prescriptive Analytics
– Unstructured data analysis
– Domain expertise
– Skills: Statistics, R, Python, Spark ML, Weka, Scala,
etc
● Data Engineering
– Infrastructure, technologies and expertise to handle
Volume, Velocity, Variety of data
– Data pipelining, ingestion, scheduling and pre-
preparation
– Job / Query optimization, parallel processing, data
processing automation
– Dashboards & Data Applications
– Hadoop, YARN, NiFi, NoSQL, Python, MapReduce,
Java,
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 16 / 156
Profile OFA DataScientist
Math & Statistics
●
Machine learning
●
Statistical modeling
●
Experimental design
●
Bayesian interference
●
Supervised learning: decision trees,
random forest, logistic regression
●
Unsupervised learning: clustering,
dimensional reduction
●
Optimization gradient descent and
variants
Domain Knowledge & Soft Skills
●
Passionate about the business
●
Curious about data
●
Influence without authority
●
Hacker mindset
●
Problem solver
●
Strategic, proactive, creative,
innovative and collaborative
Programming & Database
●
Computer science fundamentals
●
Scripting language, eg: Python
●
Statistical computing package, eg: R
●
Databases: SQL and NoSQL
●
Relational algebra
●
Parallel databases and parallel query
processing
●
MapReduce concepts
●
Hadoop and Hive/Pig
●
Custom reducers
●
Experience with xaaS like AWS
Communication & Visualization
●
Able to engage with senior management
●
Story telling skills
●
Visual at a design
●
R packages like ggplot or lattice
●
Knowledge of any of visualization tools,
eg: Flare, D3.js, Tableau
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 17 / 156
Profile OF ADataEngineer
Math & Statistics
●
Machine learning
●
Statistical modeling
●
Experimental design
●
Bayesian interference
●
Supervised learning: decision trees,
random forest, logistic regression
●
Unsupervised learning: clustering,
dimensional reduction
●
Optimization gradient descent and
variants
Domain Knowledge & Soft Skills
●
Passionate about the business
●
Curious about data
●
Influence without authority
●
Hacker mindset
●
Problem solver
●
Strategic, proactive, creative,
innovative and collaborative
Programming & Database
●
Computer science fundamentals
●
Scripting language, eg: Python
●
Statistical computing package, eg: R
●
Databases: SQL and NoSQL
●
Relational algebra
●
Parallel databases and parallel query
processing
●
MapReduce concepts
●
Hadoop and Hive/Pig
●
Custom reducers
●
Experience with xaaS like AWS
Communication & Visualization
●
Able to engage with senior management
●
Story telling skills
●
Visual at a design
●
R packages like ggplot or lattice
●
Knowledge of any of visualization tools,
eg: Flare, D3.js, Tableau
History of big data
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 19 / 156
Computing,before ‘Big Data’Era
Input Process Output
Storage
Applications only
selectively collect data
necessary for their
core functionality,
discarding the rest.
Due to technological limitations, applications mostly store recent
data, and process them to generate relatively simple reports. Old
data are regularly purged for performance and cost reasons.
Complex and intense processing dealing with massive amount of
data requires big, expensive mainframes or supercomputers
Reports and analytical
outputs are limited to
low frequency
processing (eg: daily,
monthly) due to
computing limitations.
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 20 / 156
Google– Pioneer of Big Data
All public
websites in the internet
Process & Index
Search service
* generalization / high level, not actual architecture
Google spiders / Googlebots crawls the internet,
capturing each and every web page it can capture,
and bring the data into their internal data storage
and processing infrastructure
Google backend processing engines regularly
process and update website index, and rank
websites using their proprietary Google Page
Rank algorithm, and then provide fast,
searchable index of the whole internet to end
users
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 21 / 156
Google Solution(Pre-2003) :GFS+MAPREDUCE
Google
File System Map/ReduceGoogleBots
Google
Search
Engine
Web page data are collected and
stored in a distributed datastore,
across lots of commodity hardware
MapReduce framework analyzes,
transform, and rank web pages en-masse
in periodic manner, before sent for
indexing in Google search engine cluster
Search
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 22 / 156
NUTCH Project (2002) – Attempt to createanOpen Source WebSearchEngineInfrastructure
? ?
Nutch
Crawler
?
The Nutch project was attempting to build a full scale web search engine from crawler to
indexing, however, back then, they only had a web crawler, and have yet to solve the storage
and processing problem for the data gathered.
?
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 23 / 156
Google ReleasedGFS/ MapReducePaper - 2003-2004
● Google released GFS (late 2003) and
MapReduce (late 2004) papers to the
community, describing the architecture they
use to store and manage distributed data in
Google, and how they process them in
distributed manner.
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 24 / 156
Nutch Distributed Filesystem+NutchMapReduce (2004-2005)
Nutch
Crawler
?
Being having the goal of creating a search engine, the Nutch project picked up both GFS and
MapReduce papers and develop their own implementation of both the technologies, as Nutch
Distributed File System (NDFS) and MapReduce.
?
NDFS MapReduce
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 25 / 156
Hadoop Project Branched OutfromNutch Project
2006, Hadoop project split out from Nutch project,
to provide a specialized, affordable solution for
storing and processing massive amount of data
using commodity hardware. The open source
nature of Hadoop helped spark the move towards
Big Data processing in the whole industry by
providing affordable solution for massive data
processing.
Now everybody can compute massive amount
of data!!
HDFS MapReduce
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 26 / 156
MapReduce Paper alsoinspired other technologiesfollowingitsarchitecture
Some existing database technologies,
such as MongoDB and some
PostgreSQL flavors, also adopts
MapReduce internally for computing
distributed data in its cluster
Various programming languages also
have their own libraries that implement
MapReduce as a distributed computing
algorithm, not necessarily on Hadoop
Impact of Big Data adoption to
Data Analytics practice
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 28 / 156
StagesOf Organizational DataGrowth
* source: Teradata
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 29 / 156
6Sigma – dataDrivenDecisionMaking
Supported by data from
data analytics practice
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 30 / 156
BusinessIntelligence vs BusinessAnalytics
Business Intelligence Business Analytics
What it do? Reports on what happened in the past or what is
happening in now, in current time.
Investigate why it happened & predict what may
happen in future.
How it is
achieved?
- Basic querying and reporting
- OLAP cubes, slice and dice, drill-down
- Interactive display options – Dashboards,
Scorecards, Charts, graphs, alerts
- Applying statistical and mathematical techniques
- Identifying relationships between key data variables
- Reveal hidden patterns in data
What does your
business gain?
- Dashboards with “how are we doing” information
- Standard reports and preset KPIs
- Alert mechanisms when something goes wrong
- Response to “what do we do next?”
- Proactive and planned solutions for unknown
circumstances
- The ability to adapt and respond to changes and
challenges
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 31 / 156
Componentsof DataAnalytics
Descriptive
Diagnostic
Predictive
Prescriptive
What happened?
Why it happened?
What will happen?
What to do next?
Machine
Learning
OLAP
Statistics
Artificial
Intelligence
Data
Mining
Deep
Learning Knowledge
Base
Datarequirementincreases
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 32 / 156
Descriptive
Diagnostic
Predictive
Prescriptive
Datarequirementincreases
Datascaleincreases
Computingpowerneedsincreases
Cost ofDataAnalytics
● As we go up the chain, from descriptive to
prescriptive, we would require more data to analyze in
order to compute the outputs
● Historically, only those who can afford supercomputer
and large mainframes can get into advanced
predictive & prescriptive analytics in their business by
analyzing their data assets.
– To those who can’t afford such advanced technology,
computation takes a long time that it become impractical
to apply in business
● With Big Data adoption, several barrier were removed:
– It become easy for programmers to write computation
algorithms across hundreds of commodity hardware
– Existing algorithms that used to only able to run in single
computer are ported over for distributed computing
– Cloud based architectures allows usage-based costing
with minimal to no upfront cost
– Big Data on open source technologies removes upfront
software cost for the technically savvy
– Advanced analytics become affordable for businesses
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 33 / 156
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 34 / 156
Simpler Data Collection=More Data Collection=BigGer Data
Raw Data ETL Job Transformed
Data
Raw Data Ingestion
Job
Raw Data
Replica
● Traditional Flow (ETL)
– ETL flow need to be developed for extracting and
transforming raw data before loading into the central
data management platform
– The inherent cost of design and develop of an ETL flow
and data model prevents data from being collected early
– Enhancing data model with new sources involves
changes in ETL job which can be unmaintainable in long
run
● Data Flow In Big Data Practice (ELT)
– Instead of waiting to develop ETL flow and destination
data model, raw data are brought immediately into the
central data management platform through simpler
ingestion jobs – data collection barrier removed
– Analytics can be done either on raw data, or a
transformation job can be executed post-ingestion for
preparing data model
– However, ELT come at a cost of requiring more data
storage, but hardware is usually cheaper than
manpower
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 35 / 156
ELT VSETL
Advantages Disadvantages
ELT ● No need for a separate transformation engine, the
work is done by the target system itself.
● Data transformation and loading happen in parallel,
so less time and resources are spent (as only
filtered, clean data is loaded into the target system)
● ELT works with high-end data engines such as
Hadoop cluster, cloud or data appliances. This gives
is additional performance and security.
● The processing capability of data warehousing
infrastructure reduces time that data spends in
transit and makes the system more cost effective.
● The specifics of ELT development vary on platform
i.e. Hadoop clusters work by breaking a problem
into smaller chunks, then distributing those chunks
across a large number of machines for processing.
Some problems can be easily split, others will be
much harder.
● Developers need to be aware of the nature of the
system they’re using to perform transformations.
While some systems can handle nearly any
transformation, others do not have enough
resources, requiring careful planning and design
ETL ● Single view interface to integrate heterogeneous
data
● Ability to join data both at the source and at the
integration server with the addition of the option to
apply any business rule from within a single
interface.
● Common data infrastructure for working on data
movement and data quality.
● Parallel Processing Engine for providing exceptional
performance and scalability.
● Migration from server to enterprise edition might
require vast time and resources due to the
innumerable architectural differences in the Server
and Enterprise edition.
● No automated error handling or recovery
mechanism.
● Expensive as a solution for small or midsized
companies.
Evolution of Data Management Architectures
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 37 / 156
File Based
Input Output
Store
Read
●
Most basic data management
architecture
●
Application read/write data
from files on disks
●
Reports are generated when
reading data from the files.
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 38 / 156
Database
Input Output
Store
Read
Input Output
Inserts
Query
●
Most common architecture for
applications
●
Separate application and
database service / nodes
●
Database takes care of
abstracting the complexity
and optimizing the
performance of managing
file based storage
●
Application deals with
inserting data gathered,
and querying data to
create outputs
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 39 / 156
SeparatedOLTP /OLAP Databases
Input Output
Store
Read
Input OutputInserts
Query
ETL / Sync
Input Output
Store
Read
OutputReport Queries
●
Natural path for reducing workload on
database by separating the infrastructure
for operational application use, and
analytical reporting use
●
Replica database syncs with source
database and analytical processing queries
are executed in replica and not in source
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 40 / 156
DataWarehouse
Input Output
Store
Read
Input Output
Inserts
Query
ETL
Input Output
Store
Read
OutputReport Queries
Input Output
Store
Read
ETL
●
When analytical reports are to be generated using data coming
from many data sources, a central data warehouse provides the
necessary infrastructure for cross-system analytical queries
●
Data are moved over into the data warehouse through extract-
transform-load process which normalize datasets and made it
possible for cross-system data joining to happen
●
Data marts are usually created containing more human-
understandable and domain-specific data structures for making it
easy for non-technical users to analyze data in the warehouse
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 41 / 156
Data Lake
Data Warehouse
ModernData Architecture
(Data Warehouse +Data Lake)
Input Output
Store
Read
Input Output
Inserts
Query
ETL
Input Output
Store
Read
OutputReport Queries
Input Output
Store
Read
ETL
Write
Ingest
Ingest
Input Output
Store
Read
ELT
OutputAdvanced Analytics
●
Organizations with more advanced analytical practices want to collect not
just data coming from operational databases, but also other datasets from
various sources and formats that may be generated by the applications
●
Data Lake provides a simpler architecture for gathering these datasets for
future analytical uses, and have a highly scalable platform for computing
massive data
●
Data Lake usually used together with existing Data Warehouses to leverage
its strength around structured data processing
Data Architectures For Big Data
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 43 / 156
LambdaArchitecture
ELT
W
riteW
rite
Stream Stream
Data
Stream
All Data
Batch
Precompute
Aggregated
Views
Message
Queue
Preprocess Real Time
Aggregate
Real-Time
Aggregated
Views
Batch Layer
Serving Layer
Speed Layer
Ingest
Ingest
Query
Query
Output
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 44 / 156
Query
Batch Processing
ELT WriteIngest Output
Characteristics Strength Weaknesses
● Scheduled or interactive processing
● Bulk activity
● Historical data or subset of historical
data
● Processing takes from seconds to hours
● Primarily analytical and reporting
processing
● Results are used by automated systems
or users
● Able to access and compute all data
for analysis
● Relatively simpler to implement
● Familiar setup as most systems are
batch
● Not suitable for frequent
queries if data is very large,
requires data flow
optimization to precompute
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 45 / 156
Real Time Processing
Ingest Stream Stream Write Query Output
Characteristics Strength Weaknesses
● Data are processed as it come
● Deals primarily with most recent data
● Processing records takes miliseconds to
tens of seconds
● Support complex event processing and
notifications
● Results are used by automated systems
● Lower load over time due to data are
processed as it comes throughout
the day rather than bulk operations
● Immediate update to analyzed report
through out the day, allowing faster
decision making
● More difficult to develop as it
requires writing real-time
data pipeline application
Data Collection:
An important component of data analytics
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 47 / 156
BigData TitansAredata CollectionTitans
When it comes to data collection, these companies collects
whatever they can collect from all points in their business
operations
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 48 / 156
Data
Analyticsisdependenton inputdata
Input Process Output
Storage
N
Output
N
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 49 / 156
VariousSourcesof DataCollection
Click Stream
Logs
Sensor
Web /
Social Media
RDBMS
Applications
Devices
Internet
Mobile
Databases
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 50 / 156
2StrategiesofData Collection
● Business Question Driven
– Data collected based on business needs
– Clear scope, goal and deliverables
– Manageable size
– Long turnaround time before data can be
turned into actionable insights
● Have to wait for data growth
● Advanced analytics not possible until data
grown large enough
● Collect First, Analyze Later
– Data collected as they are discovered or
required
– Builds data assets before doing data analytics
– Require initial investment for data storage
– Risk collecting useless data
– Business questions are asked against the
available data asset
● Rich data assets allows advanced analytics to
be available in shorter turnaround time
Internet Of Things
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 52 / 156
Data IsEverywhere
● Internet of Things is about a
connected world, where
everything, is connected to
internet
– Everything is an input data
source
– Everything is an output display
● Sensors everywhere
– GPS
– Temperature
– Humidity
– Luminosity
– Audio
– Video
– etc, etc, etc
● IoT brings massive amount of
data – Big Data
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 53 / 156
Typical IoTApplicationarchitecture
IoT Sensors
collects data and
send to application
backend in cloud
Army of servers
work together to
store data and
process data in
cloud for the IoT
Application
Analytical results
from analysis of
collected data
are provided to
customers and
users for
delivering value
Tools And Technologies for
big data
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 55 / 156
Ecosystem
Data Collection
Data Pipelining
Data
Processing
Data Storage
Data
Visualization
Data Serving
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 56 / 156
DataCollection
● The starting point of accumulating data assets
● Measure or capture environment variables or
state into digitalized data
● Tools/Equipment includes, but not limited to:
– Any programming language
● write out application states as logs
– Web scrapers
● Scrapy / Portia
● FMiner
● Outwit
● Mozenda
● Capterra
– Sensors equipment
● RaspberryPi
● Arduino
● Various sensor circuits
● SCADA
– RDBMS extractor
● Sqoop
● Various ETL tools
● Custom scripts
– Mobile devices
● Modern smart phones have rich array of
sensors
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 57 / 156
DataPipelining
● Move data from sources to repositories
● Coordinate and schedule data extraction and
pre-process workflow while in-flight to data
repositories
● Tools/Equipments includes, but not limited to:
– Programming libraries in various languages
● Airflow
● Luigi
● Oozie
● etc
– Traditional ETL tools
● Talend
● Pentaho
● Oracle Data Integration
● etc
– Stream data pipeline tools
● Apache NiFi
● NodeRED
● StreamSets
● Storm
● Kafka Connect
● etc
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 58 / 156
DataStorage
● Store and archive data for short and long term
use
● Work together with processing infrastructure to
extract insights from data by providing optimized
data structure
● Tools / Equipment includes but not limited to:
– Software defined distributed storage
● HDFS
● GlusterFS
● Ceph
● ZFS
● etc
– Databases
● PostgreSQL
● Oracle
● MSSQL
● etc
– NoSQL datastores
● MongoDB
● Elasticsearch
● Solr
● Neo4j
● HBase
● Redis
● etc
– Message Queues
● Kafka
● RabbitMQ
● Redis
● etc
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 59 / 156
DataProcessing
● Process and compute data to extract value
and insights
● Process data either in batch or real time,
ideally in distributed manner
● Provides algorithms for complex computations
● Tools / Equipments includes, but not limited
to:
– Any programming languages, especially R,
Scala, Python
– Distributed batch processing engines
● MapReduce
● Tez
● Hive
● Pig
● Spark
– Distributed stream processing engines
● Storm
● Celery
● StreamParse
– Traditional ETL tools
● Talend
● Pentago
● Oracle Data Integration
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 60 / 156
DataServing
● Serve processed data in for high
performance analytical queries
– Utilizes highly optimized data structures for
purpose-specific queries
● Tools / Equipments includes, but not limited
to:
– High performance OLAP
● Druid
● Kylin
– Graph data stores
● Neo4j
● ArangoDB
– Search engines
● Elasticssearch
● Solr
– Time series databases
● Graphite
● InfluxDB
● OpenTSDB
● Prometheus
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 61 / 156
DataVisualization
● Display data summary and reports in the
form of visual diagrams and charts
● Visual data discovery and exploration
● Tools / Equipments includes
– Traditional BI / reporting tools
● Pentaho
● Jasper
● SAS
● SpagoBI
● Microstrategy
● etc
– Real-time dashboarding
● Grafana
● Kibana
– Visualization libraries
● D3.js
● DC.js
● Shiny
● Bokeh
● etc
– Visualization platforms
● Tableau
● Redash
● Superset
Understanding Open Source License &
Consumption Model
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 63 / 156
ModernBig DataTechnologiesAreDriven ByOpen SourceCommunity
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 64 / 156
OpenSource Software Definition
● A software which are licensed under a license that guarantees the following rights
– Free Redistribution
● The license shall not restrict any party from selling or giving away the software as a component of an aggregate software
distribution containing programs from several different sources. The license shall not require a royalty or other fee for
such sale.
– Source Code
The program must include source code, and must allow distribution in source code as well as compiled form. Where
some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source
code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The
source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated
source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
– Derived Works
● The license must allow modifications and derived works, and must allow them to be distributed under the same terms as
the license of the original software.
– Integrity of The Author's Source Code
● The license may restrict source-code from being distributed in modified form only if the license allows the distribution of
"patch files" with the source code for the purpose of modifying the program at build time. The license must explicitly
permit distribution of software built from modified source code. The license may require derived works to carry a different
name or version number from the original software.
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 65 / 156
OpenSource Software Definition
– No Discrimination Against Persons or Groups
● The license must not discriminate against any person or group of persons.
– No Discrimination Against Fields of Endeavor
● The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may
not restrict the program from being used in a business, or from being used for genetic research.
– Distribution of License
● The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of
an additional license by those parties.
– License Must Not Be Specific to a Product
● The rights attached to the program must not depend on the program's being part of a particular software distribution. If the
program is extracted from that distribution and used or distributed within the terms of the program's license, all parties to
whom the program is redistributed should have the same rights as those that are granted in conjunction with the original
software distribution.
– License Must Not Restrict Other Software
● The license must not place restrictions on other software that is distributed along with the licensed software. For example,
the license must not insist that all other programs distributed on the same medium must be open-source software.
– License Must Be Technology-Neutral
● No provision of the license may be predicated on any individual technology or style of interface.
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 66 / 156
Open Source Does NotMeanNoCopyright
● Open Source software are copyrighted and
not public domain
– The author retains the copyright and
intellectual property, however, the author
choose to grant licensee of the software
additional rights which normally are not
granted under proprietary license
– Any users of the software automatically
become the licensee the moment they
acquire a copy of the software
– Open Source authors usually will re-use
legal license documents already exist in the
Open Source community as the license for
his software
● Should you not comply with terms and
conditions in the license document, the
author have the rights to enforce the license
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 67 / 156
Types OfOpen Source Licenses
Permissive Weak Copyleft Strong Copyleft
●
Most flexible
●
Derivative works are not
required to be Open
Source or using the
same license
●
eg: MIT, BSD
●
Some parts of derivative
works are required to be
using the same license
●
Usually this license is
used on a library
●
Modifications to itself are
required to be released
under same license, but
projects importing the
library are not required
to be using same license
●
eg: LGPL
●
Strict enforcement of
same license for any
derivative works
●
All projects importing
libraries provided by
software licensed under
this license are required
to be also released
under the same license
of the original work
●
eg: GPL, AGPL
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 68 / 156
Consumption Model
● Common misconception of Open Source in enterprise
– Open source is free
– Open source is not licensed
– Open source comes without support
– Open source software is not stable and change too fast
● Open Source is a software development model, but not exactly a software consumption
model. The consumption model is more or less similar with other software
● Software can be free, but human time are not
Software developed in
public, possibly with
community contribution,
and many frequent
improvements
Open Source software
distribution company takes a
snapshot of codebase,
stabilize, integrate, create
support, training and warranty
model, and productize the
software
Enterprise customer buys
productized software, and
receive support, training, and
warranty from distribution
company and services from SI
Ecosystem of System
Integrator, ISV, trainers
provide professional services
and added value for using the
productized software
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 69 / 156
UpstreamSoftware Vs DownstreamEnterprise Product
Upstream Enterprise
●
Rapid changing
●
Latest and greatest
features
●
Can be unstable
●
No warranty, no
support, or minimally
supported
●
Most of time free of
capital cost
●
Less changes over
short period of time
●
Tried and tested
features
●
Generally more stable
●
Comes with warranty,
support SLA, training
and certifications
●
Charged for support &
warranty subscriptions
and professional
services
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 70 / 156
DoI Have touseenterprise product?
Do I have to use
Enterprise edition
of an Open Source
software?
Are going to use
it as hobby or
professional?
Are you using it
for R&D or for
production?
Do you have
the budget?
Do you have any
regulations against using
software without
warranty or internal
expertise in production?
Professional
Hobby
R&D
Production
No
Yes
No
Yes
Not necessary
Required
Recommended
A Framework For Organizational
Data Journey
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 72 / 156
Big DataTransformation Journey
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 73 / 156
Stage 1: Data DiscoveryonActivearchive
● Initial starting infrastructure for Proof of Value
– 2 master, 3 workers for batch/interactive processing
– 1 node for stream processing
● Select several datasets and ingest both current data and historical
archive which then be made available for generating analyzing patterns
over long historical context. Also ingest related tables
– eg: Touch N’ Go transactions
● Familiarize with technologies and processes involved in Big Data
● Create reports / dashboards detailing discoveries from analysis of
historical archive
● Time frame: 6-12 months
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 74 / 156
Stage2:Data Lake
● Medium scale cluster for central data lake
– 3 master, 10-15 workers for batch/interactive processing
– 2 nodes for stream processing
● On-board more dataset from various internal sources into the data lake
to get 360 view of the organization.
– eg: CRM, ERP, Website logs, Device logs, etc
● Develop and launch reports and dashboards related to cross-dataset
relationships and patterns
– eg: 360 view of customer
● Time frame: 6-24 months
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 75 / 156
Stage3: Advanced Analytics
● Large scale cluster for complex computation
– 4 masters, 20++ workers for batch/interactive processing
– 4++ nodes for stream processing
● On-board external data sources for enrichment against internal datasets
– eg: Social media, web scrapers, IoT sensors
● Aggressive data collection and data mining as strategic direction and
asset
● Identify repetitive patterns, create model to predict them, and leverage
its use in AI-powered applications
● Time frame: 12-24 months
www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 76 / 156
Stage 4: Continuousimprovement
● Continuous data-driven transformation and innovation
Hortonworks Case Studies
The Data Journey
to a Golden Batch
The Data Journey
to a Golden Batch
79 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Case Study
Merck’s Journey
Improving Life Sciences Manufacturing Yields Presents a Complex Data
Discovery Challenge
 Vaccine manufacturing requires precise control of complex fermentation
processes
 Two batches of a vaccine, produced using an identical manufacturing process,
can exhibit significant yield variances
 Batches that fail quality standards can cost $1 million each
 Data for one vaccine was stored across 16 different systems, and high storage
costs limited the length of data retention
80 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Merck’s Journey
The Golden
Batch
Scientific Search
Sensor Data Storage
Vaccine Yield
Optimization
Innovate
Renovate
The Journey to
the Golden Batch
 Combined 10 years data on
one vaccine: 1 billion records
 5.5 million batch comparisons
 1st
year yield boost of 40K more
doses  $10M profit impact
 McKinsey: 50% yield increase
Epidemiology
The Data Journey
to Safe Roads
The Data Journey
to Safe Roads
82 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Case Study
Progressive’s Journey
Progressive Wanted to Ingest IoT Data to Predict Risk for its Usage-
based Insurance Product
 Progressive Snapshot offers usage-based insurance through an in-car
sensor that transmits IoT driving data
 Sensors collect up to six months of data from drivers and the data is
archived for years, per regulatory requirements
 Progressive’s existing systems were not scaling efficiently
 It took 5–7 days to transform only 25% of available UBI data
83 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Progressive’s Journey Rewarding Safer Drivers
and Improving Traffic
Safety
 Snapshot plug-in devices capture
driving detail
 Progressive stores more than
10 billion miles driven
 Through a web app, customers can
review their own driving detail and
improve their safety
 Snapshot and usage-based
insurance drove $2.6 billion in
2014 Progressive premiums
Innovate
Renovate
Safe Roads
Claims Notes
Mining
Individual
Driving
Histories
Usage-Based
Insurance (UBI)
Web Log
Analysis
Online Ad
Placement
Sensor Data
Ingest
The Data Journey
to Better Health
The Data Journey
to Better Health
85 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Case Study
Mercy’s Journey
Mercy Medical System Sought a Data Lake for a Single View of its Patients – “One
Patient, One Record”
 Existing platform impeded goal of enriching Epic data for 1 million patients
over 35 Hospitals and 500 clinics
 Moving Epic EMR data to Clarity EDW took 24 hours and was “never going
to enable real-time analytics”. Now that takes 3-5 minutes with HDP.
 Improved billing processes resulted in $1M additional annual revenue
from newly documented secondary diagnoses and care
86 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Case Study
Mercy’s Journey
Better HealthBilling Vital Sign
Monitoring
Single
Patient
Record
Lab Notes
Archive
Privacy
Database
Medical
Decision
Support
Device
Data
Ingest
Preventive
Care
Epic
Enrichment
OPEX
Efficiency
Epic EMR
Replication
Innovate
Renovate
Better Health
Through Data
 Searches of free-text lab notes,
speed researcher insight from
“never” to “seconds”
 Ingest of ICU vital signs
increased by 900X, letting
clinicians respond more quickly
 Mercy is building real-time
tools to support surgical decisions
and preventive care
87 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Webtrends
The Data Journey Towards
Personalized Online Ads
Webtrends
The Data Journey Towards
Personalized Online Ads
88 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Massive Volumes of Weblogs Fueled Webtrends
Growth—and also its Skyrocketing Storage Costs
Webtrends’ Journey
 Webtrends provides digital marketing solutions for more than 2,000 companies
in 60 countries – processing 13 billion daily online events
 Data used to be processed in relational databases, stored on large NAS
appliances, which were not economical at scale
 Processing occurred on-premises, without cloud-based capabilities
 Diseconomies of scale hampered the company objective to help its customers
predict optimal online ad placement
89 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Webtrends’ Journey
Personalized
Online Ads
Per-Customer
Click Path
Web Log
Analysis
SQL Server
Offload
“We’re able to…look at this data set and process it and do predictions, behavioral analysis.
We can do things that allow us to determine ROI for different actions and behavioral
patterns.”
Peter Crossley, Chief
Architect
Behavioral
Segmentation
Ad Click
Predictions
LCV
Analysis
Innovate
Renovate
Petabytes of Weblogs
Analyzed with Spark
at Scale
 Data streams from a vast array
of desktop and mobile devices
 13 billion daily events collected in
fewer than milliseconds per event
 No data cleansing necessary prior to
analysis with Apache Spark
 2 clusters consolidated into 1 YARN-
based HDP cluster
 Launched new product Webtrends
Explore™ – powered by HDP
90 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Watch The Webtrends Videos
https://youtu.be/hwpGj57VGz0
https://www.youtube.com/watch?v=LifVwIwN61E
The Data Journey
for Cyber Security
The Data Journey
for Cyber Security
92 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Symantec’s Journey
Analyzing Streaming Threat Data to Increase Velocity for Time to Protection
 The Symantec™ Global Intelligence Network includes more
than 57 million attack sensors over 157 countries
 Data streams from 75 million users on 120 million devices
 Legacy platforms created 3-4 hour processing latencies to
analyze logs files for digital threats
 Attackers could exploit those processing time windows
93 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Symantec’s Journey
Digital
Security
Metadata
Capture
Threat
Predictions
Attacker
Detection
Unified
Security
Security Log
Analysis
Threat
Archive
Device Data
Ingest
Threat
Detection
Greenplum
Offload
Innovate
Renovate
Data Science Speeds
Time to Protection
 Threat detection latency reduced
from 4 hours to 2 seconds
 Time to protection improved 5000x
 Machine learning over tens
of petabytes of historical data
predicts threats to customers
 Cloud team uses Ambari and
Cloudbreak for dynamic clusters
to meet peak workloads
94 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
The Data Journey to
Secure Telco Networks
The Data Journey to
Secure Telco Networks
95 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Neustar’s Journey
Neustar’s Telco Network Analytics Business
was Limited by High Data Storage Costs
 Neustar offers its telecommunications customers Network Analytics
services, but faced a 2011 cost of $100,000 per terabyte of storage
 It could only economically capture 10% of the data flowing through
its networks, retained for 60 days
 Neustar CEO challenged her data warehousing group to retain 100%
of the network data for at least one year
96 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Secure telecom
networks
Single View
of the
Network
Network Data
Storage
Proactive
Network
Protection
Enriched
App Data
DDoS Attack Mitigation
Rapid Threat
Response
New Info
Services
Neustar’s Journey
Innovate
Renovate
Architecture Renovation
Funded Service Innovation
 Cost per terabyte reduced from
$100K to under $250
 100% of data now retained, growing
storage capacity 150X
 Data retention extended from 60 days to
2 years
 Elimination of existing support fees saved
millions annually
 New data assets help Neustar grow
its product portfolio
97 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
The Data Journey to a
Balanced Supply Chain
The Data Journey to a
Balanced Supply Chain
98 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Cardinal Health’s Journey
 Cardinal Health supplies equipment and medicines to 85% of US hospitals and clinics
 Limited visibility into the entire supply chain prevented suppliers from understanding how
their drugs were prescribed
 Acute pharmacists couldn’t see all the product options that they could prescribe for various
conditions
Data Ingest Constrained Analysis of the Medical Supply Chain at Fuse
by Cardinal Health
99 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Cardinal Health’s Journey
Balanced Medical
Supply Chain
Drug Supply
Chain Analytics
Sensor Data
Ingest
Prescription
Archive
Pandemic
Response
Outcome-based
Medicine
Clinical Decision
Support
Public
Data
Ingest
Drug Cost
Optimization
Single
Patient
Record
Cardinal Health
Launched a New Line
of Business
 Fuse by Cardinal Health aims to
make healthcare safer and more
cost-effective
 Team enriches supply chain data
with public sources – bringing
suppliers, providers and patients
closer together
 Data processing speeds doubled
 Fuse shows suppliers how their
drugs are used
Innovate
Renovate
100 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Anonymous Case StudiesAnonymous Case Studies
Page 101 / 156
Creating Opportunity
Data: Clickstream &
Server Log
Online Ad Placement Analytics for Mega-Retailer
Problem
Digital ad firm unable to connect impression & click data
• One of the world’s largest retail websites made guesses about online ad placement based
on Google analytics
• Clickstream data flowed in 100s of MB per hour and billions of rows per month, this data
strained existing architecture
• Inability to connect ad impressions to clicks to purchases
• No ability to detect browsing device, geo-location, or whether customer was in the store
Solution
Unified web tracking data repository provides 360-degree view of online behavior
• Impression files and click files are stored in the same data lake, and easily joined for
customer insight
• With better targeting, fewer ads can be placed, improving overall customer web experience
• Social media data will be added for brand sentiment analysis
Advertising
Manages online media
programs for retail e-
commerce websites
AD1
Page 102 / 156
Monetize Anonymous & Aggregate Banking Data
Problem
Valuable banking data needed to be anonymous & unified
• Bank possesses data that indicates larger macro-economic trends, which can be
monetized in secondary markets
• Regulations and company policies protect customer privacy
• Data sets are isolated in legacy silos controlled by LOBs
• IT challenged by joining data while guaranteeing anonymity
Solution
Cross-bank data lake for aggregate data with secure access
• Multiple data sets abstracted from source platforms
• Single point of security & privacy for de-identification, masking, encryption,
authentication and access control
• Mortgage bankers, consumer bankers, credit card group and treasury bankers have
access to the same cross-sell data
• Interoperability with partners SAS, R, RedHat & Splunk
• Economies of scale for compression & archiving data
• Significant reduction in storage costs from prior platforms
Creating Opportunity
Data: Structured,
Clickstream, Social &
Unstructured
Banking
One of the largest US banks
BK1
Page 103 / 156
Sensor Data Monitors Buildings for Efficiency
Problem
Managing service calls on HVAC in commercial buildings
• More than 70K systems in buildings around US
• Systems transmits data, but mostly kept on site or discarded
• Servicing costs high, due to limited data on each unit
• Data on work orders, sales orders, service orders stored in different databases
and not correlated
Solution
Data consolidation and predictive analytics for efficiency
• Raw data from HVAC sensors will land in HDP, along with work order, sales
order and service call data
• System will predict component failures for:
– Product upsell  increased revenue
– Service call efficiency  reduced costs
• Management insight for a new service offering
Improving Efficiency
Data: Sensor
Building
Management
Building efficiency and
power solutions
>$420B in revenue
>140 employees
BM1
Page 104 / 156
Sensor Data From Smart Electricity Meters
Problem
Utility needs to match electricity supply with demand
• Utilities cannot store power, it needs to be used
• Some energy load is predictable, some is unpredictable
• Overproduction requires cutting back, running below capacity
• Underproduction risks starting less efficient “peaker plants”
• Smart meter data allows real-time analysis that can help effectively match
energy production with consumption
Solution
Predict demand spikes by analyzing real-time sensor data
• Hive + Storm on YARN streams data into Hadoop
• R + Mahout to analyze aggregate consumption trends for predictive algorithms
• More effective matching of energy production and consumption reduces energy
costs and emissions
Improving Efficiency
Data: Sensor
Energy
One of the world’s largest
producers of electricity
>$100B in revenue
>39 million customers
>150K employees
EN1
Page 105 / 156
Proactive Oil Field Decisions for Pump Equipment Utilization
Problem
Limited visibility into utilization of pump equipment
• Oil field services: exploration, drilling, well construction & production optimization
• Company manages huge base of costly equipment in the field, in 80 countries
• Time consuming, manual effort required to collect & analyze pump equipment data
• Standard data warehouse model & traditional reports did not scale well & yielded
incomplete results
Solution
Combine structured data, sensor & log data for proactive equipment decisions
• Reduces manual time and effort to collect & analyze data from sensors above and
below ground, as well as log data from pump trucks
• Big Data project runs in Accenture Cloud, with Accenture providing data architecture,
data science and project management services
• Project integrated with embedded technologies from Hortonworks technology partners:
Microsoft, SAP & HP
• Project goal: reduce equipment expense and improve margins
Improving Efficiency
Data: Structured, Sensor &
Server Log
Energy
Major provider of upstream oil
field services
>$29B in revenue
Operations in 80 countries
>75K employees
EN2
Page 106 / 156
Powering Music Recommendations
Problem
CDH cluster failed, causing down time
• Highly technical team was running CDH cluster, without support
• CDH failed, CTO asked team to research support options
• Hive table stores data on all music streamed by users
• Data on Hive is mission-critical: used to recommend music & to pull monthly
reports used to pay each music label
• Data expertise is their only sustainable competitive advantage
Solution
HDP powers music recommendation engine
• Stable recommendation engine and reconciliation reports
• Pro-active technology partnership with their engineers, who are consumers of &
contributors to Hadoop
• 2X per year, Hortonworks reviews cluster for optimization
• Data was migrated from CDH to HDP, quickly and easily
Creating Opportunity
Data: Clickstream
& Server Log
Entertainment
Online music streaming
>$500M in revenue
>24M users
ET1
Page 107 / 156
Donor and Voter Analytics for a Political Organization
Problem
Limited insight into donor behavior & voter mobilization
• Fundraising phone services lack analysis on why donors give
• For campaign management, needed analysis on what factors cause constituents to
register and vote
• Client knew they needed Hadoop for storage and analysis
• Needed education on roadmap, use cases and execution
Solution
Donor data store improves revenue from tele-fundraising
• Speed: Rapid delivery of donor data store
• Deployment flexibility: Runs in Windows environment
• Targeted: Phone reps talk to donors about their important issues
• Discovery: Explore and enrich data from campaign operations
Creating Opportunity
Data: Unstructured
Fundraising
Political organization
dedicated to tele-
fundraising, voter contact
and media services
>$1M in revenue
~100 employees
FR1
Page 108 / 156
Analysis of Gamer Data for Future Innovation
Problem
Social gaming platform needs more storage, more stability
• 4 million monthly gamers generate customer interaction data
• Existing CDH cluster was going down every month
• Desired tight integration with Datameer analytics tools
• Needed interactive query, Impala was not meeting that need
• Rapidly growing user base, need to manage cluster as it scales
Solution
HDP for stability at scale, tight integration with Datameer
• Stable cluster that doesn’t fall down like CDH did
• Easy data extracts from SQL server
• Datameer analytics tools certified on HDP
• High-performing Hive queries
• Ambari for provisioning and maintenance as cluster scales up
Creating Opportunity
Data: ETL
Gaming
Online strategy & role
playing games
~4M users
~$325M in revenue
~500 employees
GM1
Page 109 / 156
Gamer Migrates a Homegrown Cluster to HDP
Problem
Social gaming platform used Hadoop, but needed support
• Social gaming platform built its own Hadoop cluster
• Heavy users of Hive for analysis of player behavior
• Hadoop analysis informed strategy to prolong length of play, purchase virtual
goods and respond to timed in-game events
• Heavy processing needs and ~1 petabyte of data outpaced the company’s ability
to support and extend its in-house cluster
Solution
HDP functionality + Hortonworks support = better games
• Easy migration from native Hadoop cluster preserved data and processing tools
• HDP cluster includes a more complete ecosystem: Ambari, Flume, HBase, Hive,
Oozie, Pig, Sqoop, Storm, ZooKeeper
• Social media sentiment analysis combined with data on player stats and behavior,
used to improve games their revenue
Creating Opportunity
Data: Clickstream,
Server Log, Social & ETL
Gaming
Social gaming
~5M users
>$100M in revenue
~500 employees
GM2
Page 110 / 156
Clearing the Federal ETL Consulting Backlog
Problem
Federal consulting practice faces ETL backlog
• Sequestration budget cuts created demand for ETL from SAS
• Consulting practice faces backlog of millions of dollars consulting on offload from
SAS at 20 fed civilian agencies
• After offload, all data must still be easily accessible
Solution
Rationalized data storage saves taxpayer money
• Federal civilian agencies reduce ongoing data storage cost
• No loss of data or disruption to operations
• Base SAS and SAS/ACCESS are two out-of-the-box solutions for connectivity
between SAS and Hadoop, via Hive
Improving Efficiency
Data: ETL
Government
Professional service
provider consulting on
federal projects
>$13B in revenue
>50K employees
GV1
Page 111 / 156
Processing Time-Sensitive Employment Reporting w/ Confidence
Problem
Agency reporting on labor data has 9 working days to prepare report
• Agency reports on inflation, pay and benefits, unemployment levels, labor productivity
• Agency’s monthly employment report moves financial markets
• State agencies report unemployment data to federal office by first Friday of the month
• Total data set is hundreds of millions of rows in 30 comma-separated files
• If team finds errors in state data, it may take days to correct with the state affiliate
• Final report must be published by the third Friday of the month, time is precious
Solution
HDP speeds processing and improves confidence in unemployment findings
• Hortonworks partner OpenOsmium introduced Hortonworks to client team
• Federal budget pressures created favorable policies towards open source software
• POC pilot: processing one of thirty files on HDP/Amazon Cloud solution
• Processing time reduced from 18 hours to less than 1 hour
• Absolutely no disruption to existing systems or operations
• Cloud cluster runs on “as needed” basis, shut down remotely when not needed
Improving Efficiency
Data: Structured
Government
US federal government labor
agency
GV2
Page 112 / 156
Sentiment Analysis for Government Programs
Problem
Min. of Ed. felt removed from public sentiment on programs
• In-person events lacked reach and persistence
• Ministry of Education wanted to understand sentiment from citizenry on specific issues
such as childhood obesity
• Two dedicated analysts pored over social media stream and provided daily reports to
member of parliament
• IT team sought improvement over limitations of manual analysis
Solution
Powerful “same day” sentiment analysis helps outreach
• Team produces daily memos on public sentiment, now with:
– Reach: includes opinions from broader base of citizenry
– Confidence: more data, more confidence in opinion analysis
– Frequency: daily reads show policy-makers changes over time
– Precision: allows micro-analysis of specific issues and geos
• Solution aligns to government’s support for open source
• Individual social media authors receive invitations to in-person meetings with government
ministers
Creating Opportunity
Data: Social
Government
European national
government
GV3
Page 113 / 156
Sensor Data for Healthcare Supply Chain
Problem
Medical products have limited shelf life, tracking essential
• Medical products delivered to pharmacies and hospitals
• Epidemics require agile changes to delivery schedules
• Materials are time sensitive and climate-controlled
• Delivery logistics are complex & subject to risks outside of the company’s control
(product availability, weather, traffic, etc)
• Slow delivery can harm supplies and medical outcomes
Solution
Sensor data protects supply chain, improves efficiency
• Sensor data from individual items and vehicles will give the company
unprecedented supply chain visibility
• Analytic platform enable predictive algorithms for infrastructure planning, disease
forecasting and supply chain forecasts
• Better tracking reduces waste, improves customer confidence and patient health
Improving Efficiency
Data: Sensor
Healthcare
Supplier of
pharmaceuticals & medical
products to pharmacies &
hospitals
>$100B in revenue
>30K employees
HC1
Page 114 / 156
Predictive Analytics & Real-time Monitoring of Vital Signs
Problem
Unable to store sufficient data for decision support
• 22 years of data for 1.2 million patients ~ 9 million records
• Data on legacy system was not searchable nor retrievable
• Cohort selection for research projects was slow
• For decision support, clinicians had minimal access to historical data gathered
across all patients
Solution
Unified repo provides data to both researchers & clinicians
• “View only” legacy system retired, saving $500K
• 9 million historical records now searchable & retrievable
• Records stored with patient identification for clinical use, same data presented
anonymously to researchers for cohort selection
• Real-time monitoring: patches record vital signs every minute, algorithms notify
clinicians if numbers cross risk thresholds
• Readmit reduction: heart patients weigh themselves daily, algorithms notify docs
about unsafe weight changes
Improving Efficiency
Data: Sensor, Social
& ETL
Healthcare
Public university teaching
hospital
Consistently rated by US
News & World Report as
among America’s best
hospitals
>17K patient admissions
>400 physicians
~12K surgeries (‘12)
HC2
Page 115 / 156
Affordable, Scalable Data for Healthcare Analytics
Problem
Relational database architecture limited data exploration
• Develops and maintains analytic applications for doctors
• Company couldn’t access the volume or variety of data they wanted for those
applications
• Analyzing huge data sets on relational databases was too slow
Solution
HDP reveals new big data insights, with costs savings & flexibility at scale
• Link and access new types of data that are currently outside of the healthcare
domain such as: pharmacy receipts, text messages or patient web searches
• Per-node TCO of data on HDP was 25% that of current relational DB
• Open-source Hadoop ecosystem gives multiple hardware and software
integration options as company scales its architecture
Creating Opportunity
Data: ETL
Healthcare
Analytics tools and
decision support for the
healthcare industry
~$130M in revenue
>2K employees
HC3
Page 116 / 156
Rapid Detection & Intervention for Stroke Prevention
Problem
Conditions appearing to be strokes delay short windows for critical intervention
• Some conditions show stroke-like symptoms (e.g. migraines or muscle spasms)
• Stroke neurologists spend 50% of their time with non-stroke patients
• Transient ischemic attacks “TIAs” are mini-strokes that present like migraines, but are
highly predictive of future full-blown strokes within the following days
• Incomplete or slow access to patient data hampers clinician’s ability to respond promptly
to TIAs
Solution
HDP unifies present day images with historical data to quickly identify TIAs
• Patient contact records (calls to the province’s health hotline) merged with population
historical records and present-day medical images improve diagnosis
• Algorithms on population risk factors (weight, age, cardiovascular problems) are mined
for probability that a given patient has similar risk factors
• With quantified risk factors, doctors quickly identify those at risk of imminent stroke
• Prescriptions of blood thinners, exercise and diet reduce incidence of those strokes
Improving Efficiency
Data: Sensor, Unstructured &
Structured
Healthcare
Top Canadian research
university, researching
epilepsy, stroke care and
brain surgery outcomes in
government-run healthcare
system
HC4
Page 117 / 156
Management of Chronic Health Conditions Such as Epilepsy
Problem
Epilepsy is a chronic, unpredictable & difficult to treat condition
• Epilepsy can go undiagnosed while seizures are minor
• Epileptics are at higher risk of depression, making condition more difficult to manage
• Tabular data is gathered through treatment at epilepsy specialty clinics
• Additional tabular data in the system is difficult to combine for a complete picture
• Social data on patient behavior is unavailable for combination with tabular data on
clinical history and pharmaceutical prescriptions
Solution
HDP healthcare data lake joins disparate data, for better disease management
• Data lake for a 360-degree view of the patient: electronic medical records, history of
clinic visits, Facebook, Twitter & sentiment survey data
• Regular, patient self-reporting with targeted surveys via mobile and web applications
• Dynamic calculation of changing sentiment scores useful for proactive outreach
• Clinicians will have ability to reference current psychographic & sentiment data
immediately before (and during) the patient’s scheduled clinical visits
Improving Efficiency
Data: Social & Structured
Healthcare
Top Canadian research
university, researching
epilepsy, stroke care and
brain surgery outcomes in
government-run healthcare
system
HC5
Page 118 / 156
Robotics & Real-Time Decision Support in Brain Surgery
Problem
Brain surgeons make real-time decisions using only a fraction of available
data
• Brain surgeons may spend hours working (non-destructively) through brain tissue
• For aneurisms, they must clamp a weak point in a specific blood vessel
• Surgical assistant presents ~100 clamps, which surgeon uses until finding a good
fit
• Clamps exposed to surgical environment are discarded at a cost of
$100K/surgery
• Time selecting/testing clamps can negatively affect surgical outcomes
Solution
Robots, streaming video & surgery inside an MRI with real-time decision
support
• Researchers developed non-magnetic robots that surgeons control within an MRI
• Constant streaming of MRI imaging helps decisions while surgery is underway
• Recordings of MRI data stored in Hadoop, analyzed w/ machine-learning
algorithms
• MRI images compared to surgical outcomes for insight into best practices
Improving Efficiency
Data: Sensor, Unstructured &
Structured
Healthcare
Top Canadian research
university, researching
epilepsy, stroke care and
brain surgery outcomes in
government-run healthcare
system
HC6
Page 119 / 156
Data Science on Text-based Health Claim Records
Problem
Claims data in PDFs, hard to identify coding errors
• Produces applications for medical decision support
• Goal is marrying electronic health records with claims data
• 300K daily connections with individuals around unstructured data in PDFs (claims
records and patient-reported outcomes)
• Data analysis is disjointed, difficult to identify patients and events that have been
mis-coded or incompletely coded
Solution
Datasets unified in Hadoop to improve health outcomes
• Optical character recognition & natural language processing
• All of the unstructured, text-based data stored on HDP
• Coding errors will be identified much more efficiently
• Impartially coded records can also be identified
• Coding efficiency will improve revenue
• Analysis of underlying data will improve health outcomes
Improving Efficiency
Data: Unstructured
Insurance – Health
Large US medical insurer
>$100B in revenue
>100K employees
IH1
Page 120 / 156
Insurance Data Lake to Manage Risk
Problem
Challenges merging new & old data hamper analysis
• Traditional and newer types of data were both growing quickly but were difficult
to combine in the EDW
• “Schema on load” requirements of EDW platform limited ingest of some data
with significant predictive power
• Company missed data-driven ways to serve customers
• Process of separating legitimate from fraudulent claims created “needle-in-a-
haystack” problem
Solution
Common platform for all types of data improves up-sell and reduces fraud
• “Schema on read” Hadoop architecture means that more data sources can be
easily ingested to enrich predictive analytics
• Agents use big data insights to determine the best action for valued customers
and recommend those in real-time
• Claims analysts and underwriters process streaming data to quickly flag fraud
risks and fast-track legitimate claims
Creating Opportunity
Data: Structured,
Clickstream, Server Log
Insurance – Health
Large US medical insurer
>$30B in revenue
>20M members
~35K employees
IH2
Page 121 / 156
Speeding Analysis for Usage-Based Car Insurance
Problem
Risk analysis lagged because of architecture gaps
• Business insight from data analysis was too slow
• Growing volume, velocity and variety of incoming data taxed existing systems &
processes
• ETL process across disparate systems only captured 25% of the dataset, took
5-7 days to complete
Solution
Speed time-to-insight w/ clickstream analytics & faster ETL
• Clickstream analytics
– Moving from a hosted Azure platform to HDP on site will improve performance and
analytical functions (with Apache Hive)
• ETL acceleration
– Process 100% of the data, in three days or less
Creating Opportunity
Data: Clickstream & ETL
Insurance –
Property &
Casualty
Personal auto & other
property-casualty
insurance
>$17B in revenue
~28K employees
IP1
Page 122 / 156
Data Lake for P&C Insurance Claim Analysis
Problem
Structured data scaled, unstructured data analysis did not
• Large P&C insurance provider had systems for analyzing structured data at scale
• Unstructured data from claims notes and social media data had the potential to
add valuable information to claims analysis
• Structured data analysis scaled, but joining this information with hand-written or
social media data did not scale
• Limited data visibility hampered underwriting and claims
Solution
Merge structured & unstructured data for better decisions
• “Schema on read” Hadoop architecture means that more data sources can be
easily ingested (text and social media)
• Previously disparate data sets are joined for greater insight
• Larger data sets fed to front-end business tools provided by Hortonworks
partners: SAS, Tableau and QlikView
Improving Efficiency
Data: Structured, Social
& Unstructured
Insurance –
Property &
Casualty
Major provider of property
casualty, life and mortgage
insurance
>$65B in revenue
>60K employees
Operations in >100
countries
IP2
Page 123 / 156
Maintaining SLAs for Equity Trading Information
Problem
Meeting 12 millisecond SLAs for “ticker plant”
• Daily ingest: 50GB server log data from 10,000 feeds
• Four times daily, this data is pushed into DB2
• Applications query this data 35K times per second
• 70% of queries are for data <1 year old, 30% for >1 year old
• Current architecture can only hold 10 years trading data
• Growing volume puts performance at risk of missing SLAs
Solution
Meeting SLAs with confidence
• HBase provides super-fast queries within SLA targets
• ETL offloading to Hadoop allows longer data retention, without jeopardizing fast
response times
Improving Efficiency
Data: Server Log & ETL
Investment
Services
Highly trafficked website
providing business and
financial information
~15K employees
IS1
Page 124 / 156
Banking Data Lake for 100s of Use Cases
Problem
Architecture unsuited to capitalize on server log data
• Huge investments company generates valuable data assets which are largely
unavailable across the organization
• Current EDW solutions are appropriate for some data workloads but too expensive for
others
• Financial log data is difficult to aggregate & analyze at scale
• Short retention hampers price history & performance analysis
• Limited visibility into cost of acquiring customers
Solution
Multi-tenant Hadoop cluster to merge data across groups
• Server log data will be merged with structured data to uncover trends across assets,
traders and customers
• ETL offload will save money for Hadoop-appropriate workloads
• Longer data retention enables price history analysis
• Joining data sets for insight into customer acquisition costs
• Accumulo enforces read permissions on individual data cells
Creating Opportunity
Data: Server Log
Investment Services
Global investments company
> $1.5 trillion assets under
management
> $14B billion in revenue
~ 50K employees
IS2
Page 125 / 156
Anti-Laundering & Trade Surveillance for Investment Firm
Problem
Lags in back office system limit intraday risk analysis
• 15M transactions and 300K trades every day
• Storage limitations required archiving, limiting data availability
• Trading data not available for risk analysis until end of day, which hampers
intraday risk analytics and creates a time window of unacceptable exposure
Solution
Data lake accelerates time-to-analytics & extends retention
• Shared data repository combines more comprehensive data sets about all firm
activities, improving data transparency
• Operational data available to risk analysts earlier, same day
• Trading risk group will process more position, execution and balance data and
hold that data for five years
• Hadoop enables ingest of data from recent acquisitions despite disparate data
definitions and infrastructures
Creating Opportunity
Data: Structured
Investment
Services
Trading services for
millions of client accounts
>$16B in assets
>4,000 advisors
IS3
Page 126 / 156
Creating Opportunity
Data: Geolocation,
Clickstream, Server Log,
Sensor & Unstructured
Customer Insight from Consumer Electronics Product Usage Data
Problem
Lacked central repository for efficient data storage & analysis
• Rivers of data flow from millions of consumer electronic products
• Company lacked a platform to capture new types of data: geolocation, clickstream,
server log, sensor & unstructured
• Unable to exploit key competitive advantage: unique customer insight from troves of
big data
Solution
Efficient data storage unlocks value in company data
• Hadoop data lake permits view into how customers use products across multiple
types of data
• Lower cost of storage improves the margin for retaining data
• Powerful cluster includes many key ecosystem projects: Hive, Hbase, HCatalog,
Pig, Flume, Sqoop, Ambari, Oozie, Knox, Falcon, Tez and YARN
Manufacturing
Consumer electronics
>$180B in revenue
>400K employees
MF1
Page 127 / 156
Improving Efficiency
Data: SensorOptimizing High-Tech Manufacturing
Problem
Data scarcity for root cause analysis on products defects
• 200 million digital storage devices manufactured yearly
• Devices not passing QA scrapped at the end of the line
• >10K faulty devices returned by customers every month
• Limited data available for root cause analysis means that diagnosing problems is
highly manual (physical inspections)
• Subset of sensor data from QA testing retained 3-12 months
Solution
Data retention doubled, with 10x processing improvement
• Repository of sensor data now holds larger portion of total data
• Dashboard created 10x more quickly than before Hadoop
• Data retained for at least 24 months
• Manufacturing dashboard allows >1,000 employees to search data, with results
returned in less than 1 second
Manufacturing
Digital Storage Devices
>$15B in revenue
>85K employees
MF2
Page 128 / 156
Creating Opportunity
Data: Clickstream &
Server Log
Social Site Speeds Processing, Reduces Cost
Problem
Data growth outpaced existing Greenplum solution
• 20M monthly unique visitors, and growing
• Greenplum storage solution was slow and expensive
• Operations team challenged by data growth
• Analytics team hampered by slow processing speed
Solution
Processing speeds doubled, storage cost decreased
• Operations team saw processing speed 2x of Greenplum’s
• Significant cost savings from moving data to HDP
• During this second year of support relationship, plans to move more workloads to
HDP, for better insights at a lower cost
Online Community
Online social network
>$50M in revenue
>300M members
2nd
year with Hortonworks
OC1
Page 129 / 156
Creating Opportunity
Data: Clickstream,
Server Log & Social
Powering Professional Network Recommendations
Problem
Lack of a recommendation engine to promote connections
• >13M non English-speaking members find jobs & connections
• User interactions generate semi-structured data
• Clickstream, server log and social data could feed recommendations
• Company lacked stable platform to store, refine & enrich that raw data
Solution
Hadoop recommendation engine to compete with LinkedIn
• Replaced existing CDH cluster
• New types of data feed a superior recommendation engine that enhances the
value of belonging to the community
• YARN, Tez and Stinger initiative provide near-term functionality and long-term
confidence
Online Community
Online professional network
>$90M in revenue
>13M members
OC2
Page 130 / 156
Better Romantic Matches with Data Science
Problem
Newer types of data unavailable for matchmaking algorithms
• Unable to store clickstream data and user-entered content
• Other types of data only retained for seven days
• Recommendations would help users craft attractive profiles
• High costs to store an ever growing amount of data
• Relational data platform did not fulfill their requirements
Solution
Hadoop cluster for A/B testing, device analysis, text mining
• A/B testing: consolidate email & clickstream from SQL databases
• Usage patterns across devices, browsers and applications. Understand who uses
their mobile app.
• Mine user-created text (profile language and user-to-user communications) for
recommendation engine
• Longer data retention: find subtle trends with longer time window
Creating Opportunity
Data: Server Log & ETL
Online Community
Online dating site
>300 employees
OC3
Page 131 / 156
360° View of Customer for Call Center Sales
Problem
Call center sales reps unable to recommend best product
• 2000+ product lines
• Multiple customer interaction channels (web, Salesforce, face-to-face, phone)
• Poor visibility causes sales reps to miss opportunities and customer satisfaction
suffers
Solution
Improve sales conversions with optimal product recs
• Call center reps will understand every interaction with the customer, to improve
service calls
• Natural language analysis of rep emails to customers identifies best response
language and coaching opportunities
• Recommendation engine predicts the next best product for each customer
Creating Opportunity
Data: Unstructured
Retail
IT solution and equipment
reseller
>$10B in revenue
>6K employees
RT1
Page 132 / 156
360° Customer View for Home Supply Retailer
Problem
Lack of a unified customer record across all channels
• Global distribution online, in home and across 2000+ stores
• Unable to create “golden record” for analytics on customer buying behavior
across all channels
• Data repositories on website traffic, POS transactions and in-home services
existed in isolation of each other
• Limited ability for targeted marketing to specific segments
• Data storage costs increasing
Solution
HDP delivers targeted marketing & data storage savings
• Golden record enables targeted marketing capabilities: customized coupons,
promotions and emails
• Data warehouse offload saved millions in recurring expense
• Customer team continues to find unexpected, unplanned uses for their 360
degree view of customer buying behavior
Creating Opportunity
Data: Clickstream,
Unstructured, Structured
Retail
Major home improvement
retailer
>$74B in revenue
>300K employees
>2,200 stores
RT2
Page 133 / 156
Using In-Store Location Data to Improve Cross-Sell
Problem
Retailer lacks data on how customers move through stores
• Placement of product within department stores affects sales
• Sales data is not specific enough to suggest specific changes
• Online retailers can compare what shoppers view with what they buy, but they
lack this insight in brick and mortar stores
• Result: critical decisions about store layout, inventory
Solution
Micro-data on shopper location enables in-store analysis similar to website
analysis: locations visited v. purchases
• Apple iBeacon app captures in-store location data for shoppers that have the
app on their iPhones
• Data streams into HDFS on how customers move through their stores, relative
to location of particular products
• Enables real-time promotions to customers w/ smart phones, based on who
they are and where they stand in the store
• Historical data across all shoppers provides insight on store design
Creating Opportunity
Data: Sensor &
Geolocation
Retail
Major omni-channel retailer
> $27B in revenue
>175K employees
>800 stores
RT3
Page 134 / 156
Unified Data for Online Recommendation Engine
Problem
5 data sets are fragmented, hampering product recs
• 5 major data sets: inventory data, transactional data, user behavior data, customer
profiles & log data
• Unified view needed, to recommend items to users
• Currently lack analytics dashboard across all types of data
• Storing non-transactional data on EDW is expensive
Solution
Unified data lake for increased sales and lower costs
• Unified 360° view for recommendations of similar products
• Analytics dashboard joins clickstream w/ transactional data
• Summary data stored in HBase, can be queried with web apps
• Offload some data from Teradata EDW, to lower storage costs
• Actively partnering with engineers to improve Hadoop
Creating Opportunity
Data: Structured,
Clickstream, Server Log &
Unstructured
Retail
eCommerce marketplace
>$12B in revenue
>30K employees
RT4
Page 135 / 156
Predicting Car Prices With High Confidence
Problem
Achieving 99.1% confidence in car price estimates
• Goal to provide consumers & dealers reliable car price guides
• Promise: 99.1% confidence that projected price paid will be within $20 of the
average national price paid in a given week
• As network of dealers grew, existing SQL Server data warehouse was expensive
and difficult to scale
Solution
Cost savings & data reliability at scale in a data lake
• Mission-critical price data moved to Hadoop architecture
• Server log data flows into HDP with Flume
• Analysis of this data allows analysts to further improve accuracy of estimates
Creating Opportunity
Data: Server Log & ETL
Retail
Online eCommerce service
for buying and selling cars
~300 employees
RT5
Page 136 / 156
Recommendation Engine Improves Department Store Sales
Problem
Need to create better product recommendations
• Multiple touch points: store, kiosk, web and mobile app
• Wants to promote customized promotions, coupons & recs
• Data was not integrated, making 360-view of customer behaviors impossible
Solution
Recommendations to all channels, based on data lake
• Ingest all raw data from different product lines into HDP
– Real-time data ingestion
– Structured data ingestion
• Transform raw data
– ETL processing with Pig and Hive
– Use Mahout and R to make recommendations
• Recommendations will be fed to all channels
– HBase serves recommendations to web site, kiosk and mobile app
Improving Efficiency
Data: ETL
Retail
Specialty department store
>$19B in revenue
>130K employees
RT6
Page 137 / 156
Faster Reports for Real Estate Agents
Problem
Accelerate reports on movers for real estate agents
• 20 million monthly visitors to family of websites
• Reports on movers not consistently generated quickly enough
• Pressure from newer market entrants
• High data storage costs reduce margins on data
Solution
More data for faster reports at a lower cost
• Improved analytical efficiency speeds report turnaround
• Data storage costs lower than before
• Improved visibility into macro trends in real estate
• Refine, explore and enrich the data better than competitors
Improving Efficiency
Data: Clickstream & ETL
Software
Operator of real estate
websites
~$200M in revenue
>1,000 employees
SW1
Page 138 / 156
Unified View Across Products, for Product Managers
Problem
Data fragmentation across products and verticals
• More than 20 product lines
• Multiple verticals: retail, financial services, healthcare, manufacturing,
communications, utilities & government
• Each product line has a separate data repository
• Unified analysis across product lines was impossible
Solution
Data consolidation for cross-product customer analysis
• Product managers will have unified data for analysis
• Raw data from different products will land in HDP
• Data will then be refined and transformed
• Real time data ingestion with Flume
• Batch data movement with Sqoop
• ETL processing with Pig and Hive
Creating Opportunity
Data: ETL
Software
Data security software,
cloud computing
~$130M in revenue
~1,100 employees
SW2
Page 139 / 156
Data Lake Protects Customers’ Enterprise Data Security
Problem
Batch processing created risk exposure, redundant systems drove costs up
• Customer protects the world’s largest organizations from data security breaches and
backs up their mission-critical data
• Process client data to identify threats and vulnerabilities
• Multiple acquisitions led to a redundant patchwork of big data analysis solutions,
including: Greenplum, Netezza and Vertica
• Six LOBs needed a common, multi-tenant data repository
• Existing batch processing caused 15-minute latency window, with exposure risk
Solution
HDP data lake consolidates infrastructure, reduces cost & speeds response times
• Consolidation into one HDP data lake represents savings of tens of millions of dollars
• Multi-tenancy with YARN permits secure access to multiple LOBs
• Real-time analysis with Apache Storm and interactive query with Apache Hive close the
15-minute risk window from earlier architecture
• Data lake also used for marketing: clickstream analysis & 360-degree customer view
Improving Efficiency
Data: Server Log,
Clickstream & ETL
Software
Global leader in data security,
storage and system
management software
>$6B in revenue
>18K employees
SW3
Page 140 / 156
Launching New Data Analysis Products
Problem
Enterprise customers have no visibility into performance
• Platforms connect 3.4 billion transactions per year
• Currently storing 90TB, growing at 20% YoY
• All divisions retain 36 months, except healthcare network: 7yrs
• Customers have no visibility into their companies’ activity on their commerce
platforms
• Client wants to add analytics services to cross-sell to existing customers and
attract new customers
Solution
HDP data lake enables launch of new information products
• Shorten data processing workloads from days to hours
• Enable ad hoc analytics queries
• Create data analysis products and services for customers of promotion, supply
chain and healthcare networks
• New product: anonymous reports that benchmark customer against competitors in
same industry
Creating Opportunity
Data: ETL
Software
Operator of intelligent
ecommerce networks
>1,400 customers
~5K employees
SW4
Page 141 / 156
Product Managers Speed Product Innovation with Hadoop
Problem
Product managers needed to analyze server logs
• 130K clients drive 780M transactions per day
• Services incorporate streams from core CRM and 3rd
party platforms like
Twitter, Facebook and YouTube
• Product managers need to capture and interpret server log data to analyze new
feature adoption & performance
• Unable to process current volume using relational data stores
• Unable to retain enough data because of cost
Solution
HDP gives PMs power, reliability and liberty
• Power: Analysis of more than 30TB per month
• Reliability: Previous system broke every 2 weeks. No longer.
• Liberty: Open source solution prevents vendor lock
• HDP increases Product Management storage and analysis without
corresponding increase in IT spend
Creating Opportunity
Data: Server Log
Software
Sales & CRM software,
cloud computing
~$3B in revenue
~10K employees
SW5
Page 142 / 156
eCommerce Platform Uses Data Lake for Insight
Problem
New types of data difficult to store, unavailable for analysis
• Millions of payments processed every day
• Fraudsters selling fake items or extract buyer account info
• Some creditors default, resulting in losses
• Unable to store current volume using relational data stores
• Unable to retain vintage data because of RDBMS storage cost
Solution
HDP data lake accelerates multiple analysis projects
• Platform stores all new types of data: clickstream, social, sensor, geolocation,
server logs and unstructured data
• Detects and prevents theft: fraudsters stealing from members
• Assesses credit risk: server log analysis & machine learning
• Manages offers: aggregates data for advertisers
• User experience: social sentiment analysis on usability
• Site optimization: analyze clickstream for site improvements
Creating Opportunity
Data: Server Log
Software
eCommerce payments
platform
~$6B in revenue
>130M users
~13K employees
SW6
Page 143 / 156
Offloading Clickstream Data from Netezza
Problem
System receives millions of call detail records per second
• Netezza EDW operating near capacity
• Netezza housing exhaust data not required for intended reporting and analytics,
leading to unnecessary expense
• Enterprise IT maintained redundant data stores
• Unable to store clickstream data to enrich consumer intelligence
Solution
Longer storage, lower cost & better consumer intelligence
• Hadoop will recover premium Teradata cycles, currently used for transformations
and data movement
• Projected costs savings of >$1M by offloading exhaust data
• Analysis of clickstream adds new dimension of customer view
• Improved service efficiency: bill processing & reporting
Improving Efficiency
Data: ETL & Clickstream
Telecom
Major telecom provider
~ $25B in revenue
> 40M customers
TC1
Page 144 / 156
Unified Household View of the Customer
Problem
Acquisitions & data explosion fragment view of customer
• Recent acquisitions and proliferation of types of data caused fragmented view of
customers
• Data exists across multiple applications & data stores
• Semi-structured data: social, sensors & networked devices
• Difficult to integrate structured, semi-structured & unstructured data sets from so
many distinct sources
Solution
HDP data lake delivers 360° unified household View
• Stable environment for exploring and enriching the data
• Store all of the data and retain it for longer
• Parse on demand: no need to pre-parse data before loading
• Analysis on demand: allows analysts to explore raw data and find unexpected
truths in the data
Creating Opportunity
Data: ETL, Social,
Sensor & Clickstream
Telecom
Major telecom provider,
offering data networks &
services
> $100B in revenue
> 200K employees
TC2
Page 145 / 156
Call Record Analysis for Improved Cell Service
Problem
System receives millions of call detail records per second
• System enables proactive management of phone call quality
• Call detail records (CDRs) are the raw data used for analysis
• Millions of CDRs stream in every second
• Storage is expensive & ingest rates are increasing 20% YoY
• 24-hour data retention not sufficient to discover long-term trends
Solution
Longer storage & rich analysis improve customer service
• HDP’s 10:1 compression allows affordable 6 month retention
• Improved forensics on instances of poor call quality drive:
– Informed decisions on expansion of transmission infrastructure
– Predictive analytics on when to repair/replace equipment
• Access to more data helps service reps solve customer issues in near real-time
Creating Opportunity
Data: Sensor
Telecom
Major telecom provider,
offering data networks &
services
> $100B in revenue
> 200K employees
TC3
Page 146 / 156
ETL: 100x the Data, 12x Longer, $3M Saved
Problem
Changing business model required new data architecture
• Started in 1990s as neutral intermediary for telco networks
• Network management market is mature
• CEO challenged company to build business for data analysis and information
services, related to telecom data
• Netezza data capacity limited to 20TB
• Only stored 1% of total dataset, retained for only 60 days
Solution
More data, stored longer, with $3 million in cost savings
• Avoided $3M annual expense, compared to Netezza
• Now storing 100% of data, retained for two years
• Larger data set supports new, accurate information products
• Improved access to data for more employees drives new innovation across the
enterprise
Creating Opportunity
Data: ETL
Telecom
Telco information and
analytics vendor
$800M in revenue
~2,500 employees
TC4
Page 147 / 156
Searchable ETL for CDRs & Customer Data
Problem
Data storage costs limit the amount and types of data available for analysis
• Teradata and Vertica used for data storage, ideal for certain data workloads, but
unsuited for less structured types of data
• Limited retention of call detail records (CDRs)
• Limited analysis across call logs, CRM records & customer acquisition models
Solution
Data lake: ETL, data exploration & NPTB recommendations
• Partners Teradata, HP and Impetus helped craft a solution
• CDRs now retained for longer, improving visibility & analysis
• Customer retention data can be correlated to service quality
• Plan to integrate search for real-time NPTB recommendations
• Improved customer acquisition and retention
Creating Opportunity
Data: Structured, Server Log
& Geo-location
Telecom
Telco vendor specializing in
VOIP
> $800M in revenue
> 2M subscribers
~ 1,000 employees
TC5
Page 148 / 156
Better Service to Premium Customers, for Less
Problem
Inability to identify base stations serving premier customers
• CRM system and network logs were in isolated data silos
• Company unable to analyze base station usage by premium customers, to prioritize
investments
• Info gap prevented optimal ROI on infrastructure investments
Solution
HDP joins structured CRM & unstructured network data at scale
• Partnered with Datameer and HP to deliver a unified solution
• Joins network data on utilization of base stations with CRM data on the value of
customers using those stations most often
• Optimizes service to the most valuable customers
• Efficient resource allocation reduces overall cost to maintain network infrastructure
Improving Efficiency
Data: Structured &
Server Log
Telecom
Major European telco
> $800M in revenue
> 300M customers
> 100K employees
TC6
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data
Data Driven Journey to Demystifying Big Data

More Related Content

What's hot

Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesUpXAcademy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data science fin_tech_2016
Data science fin_tech_2016Data science fin_tech_2016
Data science fin_tech_2016iECARUS
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBernard Marr
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayXoriant Corporation
 

What's hot (20)

Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data Presentation
Big Data PresentationBig Data Presentation
Big Data Presentation
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science roles
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data science fin_tech_2016
Data science fin_tech_2016Data science fin_tech_2016
Data science fin_tech_2016
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business Needs
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop Way
 
Big data
Big dataBig data
Big data
 

Similar to Data Driven Journey to Demystifying Big Data

Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBMongoDB
 
A Data Fabric for All Things Intelligent
A Data Fabric for All Things IntelligentA Data Fabric for All Things Intelligent
A Data Fabric for All Things IntelligentDenodo
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'AlmereDataCapital
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataSociety of Petroleum Engineers
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnectaDigital
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data AnalyticsCynthia Saracco
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enoughCloudera, Inc.
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseDatabricks
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Achieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendAchieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendTalend
 
Data - Science and Engineering slide at Bandungpy Sharing Session
Data - Science and Engineering slide at Bandungpy Sharing SessionData - Science and Engineering slide at Bandungpy Sharing Session
Data - Science and Engineering slide at Bandungpy Sharing SessionHendri Karisma
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Modern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyModern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyNeo4j
 

Similar to Data Driven Journey to Demystifying Big Data (20)

Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
A Data Fabric for All Things Intelligent
A Data Fabric for All Things IntelligentA Data Fabric for All Things Intelligent
A Data Fabric for All Things Intelligent
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big Data
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
 
The Big Data Dream Team
The Big Data Dream TeamThe Big Data Dream Team
The Big Data Dream Team
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 
Talend introduction v1
Talend introduction v1Talend introduction v1
Talend introduction v1
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Achieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendAchieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - Talend
 
Data - Science and Engineering slide at Bandungpy Sharing Session
Data - Science and Engineering slide at Bandungpy Sharing SessionData - Science and Engineering slide at Bandungpy Sharing Session
Data - Science and Engineering slide at Bandungpy Sharing Session
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Modern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyModern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph Technology
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Data Driven Journey to Demystifying Big Data

  • 1. Paving the Way to "Data Driven” Mohd izhar firdaus ismail solution architect abyres enterprise technologies sdn bhd
  • 2. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 2 / 156 About Me ● About Me – Mohd Izhar Firdaus Bin Ismail – Solution Architect & Head Data Engineering Department, ABYRES Enterprise Technologies Sdn Bhd ● About ABYRES – System Integrator company focusing in consulting and implementation of state of the art solutions around Open Source IT infrastructure and data center modernization ● Data Engineering & Big Data ● IT Modernization ● Enterprise Mobility Platform ● Unix to Linux Migration
  • 3. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 3 / 156 Outline Demystifying Big Data History of Big Data Impact of Big Data Evolution Of Data Management Big Data Architectures Data Collection Internet Of Things Tools & Technologies Open Source License Framework For Data Journey Hortonworks Case Studies
  • 5. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 5 / 156 BackToBasics Input Process Output Storage
  • 6. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 6 / 156 Traditional Computing :“Small”Data Input Process Output Storage Low to medium rate of data coming in, can be easily collected using softwares / applications that runs on single-core, or multi-threaded Processing low to medium amount of data can be easily done in using simple architectures, processing data in single/multi-core environments. Managing storage of data is also simple, using single disk or an array of disks merged together in RAID, in a single machine. Outputs are showing simple results that easily can be viewed using client softwares that can even load the whole dataset, while still giving good performance
  • 7. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 7 / 156 BigData Computing :Massive Data! Input Process Output Storage High volume and velocity of data coming in calls for a totally different breed of data collection and pipelining software that can run in distributed environment, across thousands of cores across hundreds of computers High volume of data with complex processing needs, especially in processing complex relationships and complex unstructured data requires high throughput distributed computing to process data and get results on time for business to use. RAID arrays no longer enough to store the high volume of data, calling for distributed storage that can easily scale to cater to additional data that are coming in with high velocity Analyzing and visualizing massive amount of data to make sense of its complex relationships can’t be easily done through basic charts, calling for new visualization techniques, and strategy to minimize client-side processing of visualization in order to have good rendering performance
  • 8. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 8 / 156 CriteriaofBig Data ● Volume – Data coming from various sources and increased regulation in multiple areas means storing more data for longer period of time – Gigabytes, Terabytes, Petabytes …... Zetabytes ● Velocity – Machine data as well data coming from new sources are being ingested ad speeds not even imagined a few years ago. – 1MB/s , 10MB/s, 50MB/s growth rate and beyond ● Variety – Unstructured and semi-structured data is becoming as strategic as structured data – Video, audio, images, free text
  • 9. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 9 / 156 Uncaptured & UnanalyzedData – AMissed Opportunity All organizations have data lying around, either not yet captured, poorly captured, or captured but not analyzed. The data may contain hidden gems for improving decision making, leaving them alone is a missed opportunity
  • 10. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 10 / 156 Big DataProcessing TechnologiesareNOT BigData ● It is a common misconception that if one is adopting Hadoop, Spark, etc, they are adopting Big Data. However, this is not true. ● Big Data is that massive amount of data you either have collected, or have opportunity to collect, which you unable to collect and process, due to computing limitations or cost limitations. ● Adopting Hadoop, Spark or whatever Big Data technologies, without a strong data collection and analysis strategy will not give you the benefits that you might want. !=
  • 11. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 11 / 156 How BigISBig? ● A common question – How big should my data be, for it to be considered Big Data? ● The answer is pretty subjective depending on organization, data, analytical processes, and outputs you are dealing with. But ask yourself these questions: Do you have a Big Data problem? Is your current data architecture / infrastructure, able to collect and produce the output you require, in timely manner? Do you plan to collect more and more data, terabytes and beyond, with the goal of analyzing them in very rapid manner, and you want to archive the raw data for long period of time? Yes No You are likely not dealing with a Big Data problem No Yes You are likely dealing with a Big Data problem, or just optimization problem
  • 12. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 12 / 156 data– the new oil
  • 13. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 13 / 156 Basicconceptsof petroleum mining Petroleum reserve exists in wells and shale Mining equipment extract petroleum from wells Petroleum pipelines transport petroleum to silos and refineries Refineries refine petroleum to create petroleum based products for consumers Silos stores petroleum before they are processed Petroleum engineers design, construct, and maintain petroleum mining, pipeline, silo and refineries Petroleum scientists research on petroleum to create new products and applications using components in petroleum
  • 14. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 14 / 156 DATAMining & Analytics Data exists in environment Sensors and data collection software extract data from environment Ingestion / ETL data pipelines bring raw data to central data repository Data refineries / processing software processes data to extract analytical results for use by data consumers Data repositories / databases stores data for analytics purposes Data engineers design, construct, and maintain data mining, pipeline, repositories and processing infrastructure Data scientists research on data to create new products and applications using analytical results from data We are your data engineers!
  • 15. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 15 / 156 Handling Big Data :Data Science Vs DataEngineering ● Data Science / Data Analysis – Extract value from data – Descriptive/Predictive/Prescriptive Analytics – Unstructured data analysis – Domain expertise – Skills: Statistics, R, Python, Spark ML, Weka, Scala, etc ● Data Engineering – Infrastructure, technologies and expertise to handle Volume, Velocity, Variety of data – Data pipelining, ingestion, scheduling and pre- preparation – Job / Query optimization, parallel processing, data processing automation – Dashboards & Data Applications – Hadoop, YARN, NiFi, NoSQL, Python, MapReduce, Java,
  • 16. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 16 / 156 Profile OFA DataScientist Math & Statistics ● Machine learning ● Statistical modeling ● Experimental design ● Bayesian interference ● Supervised learning: decision trees, random forest, logistic regression ● Unsupervised learning: clustering, dimensional reduction ● Optimization gradient descent and variants Domain Knowledge & Soft Skills ● Passionate about the business ● Curious about data ● Influence without authority ● Hacker mindset ● Problem solver ● Strategic, proactive, creative, innovative and collaborative Programming & Database ● Computer science fundamentals ● Scripting language, eg: Python ● Statistical computing package, eg: R ● Databases: SQL and NoSQL ● Relational algebra ● Parallel databases and parallel query processing ● MapReduce concepts ● Hadoop and Hive/Pig ● Custom reducers ● Experience with xaaS like AWS Communication & Visualization ● Able to engage with senior management ● Story telling skills ● Visual at a design ● R packages like ggplot or lattice ● Knowledge of any of visualization tools, eg: Flare, D3.js, Tableau
  • 17. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 17 / 156 Profile OF ADataEngineer Math & Statistics ● Machine learning ● Statistical modeling ● Experimental design ● Bayesian interference ● Supervised learning: decision trees, random forest, logistic regression ● Unsupervised learning: clustering, dimensional reduction ● Optimization gradient descent and variants Domain Knowledge & Soft Skills ● Passionate about the business ● Curious about data ● Influence without authority ● Hacker mindset ● Problem solver ● Strategic, proactive, creative, innovative and collaborative Programming & Database ● Computer science fundamentals ● Scripting language, eg: Python ● Statistical computing package, eg: R ● Databases: SQL and NoSQL ● Relational algebra ● Parallel databases and parallel query processing ● MapReduce concepts ● Hadoop and Hive/Pig ● Custom reducers ● Experience with xaaS like AWS Communication & Visualization ● Able to engage with senior management ● Story telling skills ● Visual at a design ● R packages like ggplot or lattice ● Knowledge of any of visualization tools, eg: Flare, D3.js, Tableau
  • 19. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 19 / 156 Computing,before ‘Big Data’Era Input Process Output Storage Applications only selectively collect data necessary for their core functionality, discarding the rest. Due to technological limitations, applications mostly store recent data, and process them to generate relatively simple reports. Old data are regularly purged for performance and cost reasons. Complex and intense processing dealing with massive amount of data requires big, expensive mainframes or supercomputers Reports and analytical outputs are limited to low frequency processing (eg: daily, monthly) due to computing limitations.
  • 20. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 20 / 156 Google– Pioneer of Big Data All public websites in the internet Process & Index Search service * generalization / high level, not actual architecture Google spiders / Googlebots crawls the internet, capturing each and every web page it can capture, and bring the data into their internal data storage and processing infrastructure Google backend processing engines regularly process and update website index, and rank websites using their proprietary Google Page Rank algorithm, and then provide fast, searchable index of the whole internet to end users
  • 21. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 21 / 156 Google Solution(Pre-2003) :GFS+MAPREDUCE Google File System Map/ReduceGoogleBots Google Search Engine Web page data are collected and stored in a distributed datastore, across lots of commodity hardware MapReduce framework analyzes, transform, and rank web pages en-masse in periodic manner, before sent for indexing in Google search engine cluster Search
  • 22. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 22 / 156 NUTCH Project (2002) – Attempt to createanOpen Source WebSearchEngineInfrastructure ? ? Nutch Crawler ? The Nutch project was attempting to build a full scale web search engine from crawler to indexing, however, back then, they only had a web crawler, and have yet to solve the storage and processing problem for the data gathered. ?
  • 23. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 23 / 156 Google ReleasedGFS/ MapReducePaper - 2003-2004 ● Google released GFS (late 2003) and MapReduce (late 2004) papers to the community, describing the architecture they use to store and manage distributed data in Google, and how they process them in distributed manner.
  • 24. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 24 / 156 Nutch Distributed Filesystem+NutchMapReduce (2004-2005) Nutch Crawler ? Being having the goal of creating a search engine, the Nutch project picked up both GFS and MapReduce papers and develop their own implementation of both the technologies, as Nutch Distributed File System (NDFS) and MapReduce. ? NDFS MapReduce
  • 25. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 25 / 156 Hadoop Project Branched OutfromNutch Project 2006, Hadoop project split out from Nutch project, to provide a specialized, affordable solution for storing and processing massive amount of data using commodity hardware. The open source nature of Hadoop helped spark the move towards Big Data processing in the whole industry by providing affordable solution for massive data processing. Now everybody can compute massive amount of data!! HDFS MapReduce
  • 26. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 26 / 156 MapReduce Paper alsoinspired other technologiesfollowingitsarchitecture Some existing database technologies, such as MongoDB and some PostgreSQL flavors, also adopts MapReduce internally for computing distributed data in its cluster Various programming languages also have their own libraries that implement MapReduce as a distributed computing algorithm, not necessarily on Hadoop
  • 27. Impact of Big Data adoption to Data Analytics practice
  • 28. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 28 / 156 StagesOf Organizational DataGrowth * source: Teradata
  • 29. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 29 / 156 6Sigma – dataDrivenDecisionMaking Supported by data from data analytics practice
  • 30. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 30 / 156 BusinessIntelligence vs BusinessAnalytics Business Intelligence Business Analytics What it do? Reports on what happened in the past or what is happening in now, in current time. Investigate why it happened & predict what may happen in future. How it is achieved? - Basic querying and reporting - OLAP cubes, slice and dice, drill-down - Interactive display options – Dashboards, Scorecards, Charts, graphs, alerts - Applying statistical and mathematical techniques - Identifying relationships between key data variables - Reveal hidden patterns in data What does your business gain? - Dashboards with “how are we doing” information - Standard reports and preset KPIs - Alert mechanisms when something goes wrong - Response to “what do we do next?” - Proactive and planned solutions for unknown circumstances - The ability to adapt and respond to changes and challenges
  • 31. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 31 / 156 Componentsof DataAnalytics Descriptive Diagnostic Predictive Prescriptive What happened? Why it happened? What will happen? What to do next? Machine Learning OLAP Statistics Artificial Intelligence Data Mining Deep Learning Knowledge Base Datarequirementincreases
  • 32. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 32 / 156 Descriptive Diagnostic Predictive Prescriptive Datarequirementincreases Datascaleincreases Computingpowerneedsincreases Cost ofDataAnalytics ● As we go up the chain, from descriptive to prescriptive, we would require more data to analyze in order to compute the outputs ● Historically, only those who can afford supercomputer and large mainframes can get into advanced predictive & prescriptive analytics in their business by analyzing their data assets. – To those who can’t afford such advanced technology, computation takes a long time that it become impractical to apply in business ● With Big Data adoption, several barrier were removed: – It become easy for programmers to write computation algorithms across hundreds of commodity hardware – Existing algorithms that used to only able to run in single computer are ported over for distributed computing – Cloud based architectures allows usage-based costing with minimal to no upfront cost – Big Data on open source technologies removes upfront software cost for the technically savvy – Advanced analytics become affordable for businesses
  • 33. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 33 / 156
  • 34. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 34 / 156 Simpler Data Collection=More Data Collection=BigGer Data Raw Data ETL Job Transformed Data Raw Data Ingestion Job Raw Data Replica ● Traditional Flow (ETL) – ETL flow need to be developed for extracting and transforming raw data before loading into the central data management platform – The inherent cost of design and develop of an ETL flow and data model prevents data from being collected early – Enhancing data model with new sources involves changes in ETL job which can be unmaintainable in long run ● Data Flow In Big Data Practice (ELT) – Instead of waiting to develop ETL flow and destination data model, raw data are brought immediately into the central data management platform through simpler ingestion jobs – data collection barrier removed – Analytics can be done either on raw data, or a transformation job can be executed post-ingestion for preparing data model – However, ELT come at a cost of requiring more data storage, but hardware is usually cheaper than manpower
  • 35. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 35 / 156 ELT VSETL Advantages Disadvantages ELT ● No need for a separate transformation engine, the work is done by the target system itself. ● Data transformation and loading happen in parallel, so less time and resources are spent (as only filtered, clean data is loaded into the target system) ● ELT works with high-end data engines such as Hadoop cluster, cloud or data appliances. This gives is additional performance and security. ● The processing capability of data warehousing infrastructure reduces time that data spends in transit and makes the system more cost effective. ● The specifics of ELT development vary on platform i.e. Hadoop clusters work by breaking a problem into smaller chunks, then distributing those chunks across a large number of machines for processing. Some problems can be easily split, others will be much harder. ● Developers need to be aware of the nature of the system they’re using to perform transformations. While some systems can handle nearly any transformation, others do not have enough resources, requiring careful planning and design ETL ● Single view interface to integrate heterogeneous data ● Ability to join data both at the source and at the integration server with the addition of the option to apply any business rule from within a single interface. ● Common data infrastructure for working on data movement and data quality. ● Parallel Processing Engine for providing exceptional performance and scalability. ● Migration from server to enterprise edition might require vast time and resources due to the innumerable architectural differences in the Server and Enterprise edition. ● No automated error handling or recovery mechanism. ● Expensive as a solution for small or midsized companies.
  • 36. Evolution of Data Management Architectures
  • 37. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 37 / 156 File Based Input Output Store Read ● Most basic data management architecture ● Application read/write data from files on disks ● Reports are generated when reading data from the files.
  • 38. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 38 / 156 Database Input Output Store Read Input Output Inserts Query ● Most common architecture for applications ● Separate application and database service / nodes ● Database takes care of abstracting the complexity and optimizing the performance of managing file based storage ● Application deals with inserting data gathered, and querying data to create outputs
  • 39. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 39 / 156 SeparatedOLTP /OLAP Databases Input Output Store Read Input OutputInserts Query ETL / Sync Input Output Store Read OutputReport Queries ● Natural path for reducing workload on database by separating the infrastructure for operational application use, and analytical reporting use ● Replica database syncs with source database and analytical processing queries are executed in replica and not in source
  • 40. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 40 / 156 DataWarehouse Input Output Store Read Input Output Inserts Query ETL Input Output Store Read OutputReport Queries Input Output Store Read ETL ● When analytical reports are to be generated using data coming from many data sources, a central data warehouse provides the necessary infrastructure for cross-system analytical queries ● Data are moved over into the data warehouse through extract- transform-load process which normalize datasets and made it possible for cross-system data joining to happen ● Data marts are usually created containing more human- understandable and domain-specific data structures for making it easy for non-technical users to analyze data in the warehouse
  • 41. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 41 / 156 Data Lake Data Warehouse ModernData Architecture (Data Warehouse +Data Lake) Input Output Store Read Input Output Inserts Query ETL Input Output Store Read OutputReport Queries Input Output Store Read ETL Write Ingest Ingest Input Output Store Read ELT OutputAdvanced Analytics ● Organizations with more advanced analytical practices want to collect not just data coming from operational databases, but also other datasets from various sources and formats that may be generated by the applications ● Data Lake provides a simpler architecture for gathering these datasets for future analytical uses, and have a highly scalable platform for computing massive data ● Data Lake usually used together with existing Data Warehouses to leverage its strength around structured data processing
  • 43. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 43 / 156 LambdaArchitecture ELT W riteW rite Stream Stream Data Stream All Data Batch Precompute Aggregated Views Message Queue Preprocess Real Time Aggregate Real-Time Aggregated Views Batch Layer Serving Layer Speed Layer Ingest Ingest Query Query Output
  • 44. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 44 / 156 Query Batch Processing ELT WriteIngest Output Characteristics Strength Weaknesses ● Scheduled or interactive processing ● Bulk activity ● Historical data or subset of historical data ● Processing takes from seconds to hours ● Primarily analytical and reporting processing ● Results are used by automated systems or users ● Able to access and compute all data for analysis ● Relatively simpler to implement ● Familiar setup as most systems are batch ● Not suitable for frequent queries if data is very large, requires data flow optimization to precompute
  • 45. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 45 / 156 Real Time Processing Ingest Stream Stream Write Query Output Characteristics Strength Weaknesses ● Data are processed as it come ● Deals primarily with most recent data ● Processing records takes miliseconds to tens of seconds ● Support complex event processing and notifications ● Results are used by automated systems ● Lower load over time due to data are processed as it comes throughout the day rather than bulk operations ● Immediate update to analyzed report through out the day, allowing faster decision making ● More difficult to develop as it requires writing real-time data pipeline application
  • 46. Data Collection: An important component of data analytics
  • 47. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 47 / 156 BigData TitansAredata CollectionTitans When it comes to data collection, these companies collects whatever they can collect from all points in their business operations
  • 48. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 48 / 156 Data Analyticsisdependenton inputdata Input Process Output Storage N Output N
  • 49. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 49 / 156 VariousSourcesof DataCollection Click Stream Logs Sensor Web / Social Media RDBMS Applications Devices Internet Mobile Databases
  • 50. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 50 / 156 2StrategiesofData Collection ● Business Question Driven – Data collected based on business needs – Clear scope, goal and deliverables – Manageable size – Long turnaround time before data can be turned into actionable insights ● Have to wait for data growth ● Advanced analytics not possible until data grown large enough ● Collect First, Analyze Later – Data collected as they are discovered or required – Builds data assets before doing data analytics – Require initial investment for data storage – Risk collecting useless data – Business questions are asked against the available data asset ● Rich data assets allows advanced analytics to be available in shorter turnaround time
  • 52. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 52 / 156 Data IsEverywhere ● Internet of Things is about a connected world, where everything, is connected to internet – Everything is an input data source – Everything is an output display ● Sensors everywhere – GPS – Temperature – Humidity – Luminosity – Audio – Video – etc, etc, etc ● IoT brings massive amount of data – Big Data
  • 53. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 53 / 156 Typical IoTApplicationarchitecture IoT Sensors collects data and send to application backend in cloud Army of servers work together to store data and process data in cloud for the IoT Application Analytical results from analysis of collected data are provided to customers and users for delivering value
  • 54. Tools And Technologies for big data
  • 55. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 55 / 156 Ecosystem Data Collection Data Pipelining Data Processing Data Storage Data Visualization Data Serving
  • 56. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 56 / 156 DataCollection ● The starting point of accumulating data assets ● Measure or capture environment variables or state into digitalized data ● Tools/Equipment includes, but not limited to: – Any programming language ● write out application states as logs – Web scrapers ● Scrapy / Portia ● FMiner ● Outwit ● Mozenda ● Capterra – Sensors equipment ● RaspberryPi ● Arduino ● Various sensor circuits ● SCADA – RDBMS extractor ● Sqoop ● Various ETL tools ● Custom scripts – Mobile devices ● Modern smart phones have rich array of sensors
  • 57. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 57 / 156 DataPipelining ● Move data from sources to repositories ● Coordinate and schedule data extraction and pre-process workflow while in-flight to data repositories ● Tools/Equipments includes, but not limited to: – Programming libraries in various languages ● Airflow ● Luigi ● Oozie ● etc – Traditional ETL tools ● Talend ● Pentaho ● Oracle Data Integration ● etc – Stream data pipeline tools ● Apache NiFi ● NodeRED ● StreamSets ● Storm ● Kafka Connect ● etc
  • 58. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 58 / 156 DataStorage ● Store and archive data for short and long term use ● Work together with processing infrastructure to extract insights from data by providing optimized data structure ● Tools / Equipment includes but not limited to: – Software defined distributed storage ● HDFS ● GlusterFS ● Ceph ● ZFS ● etc – Databases ● PostgreSQL ● Oracle ● MSSQL ● etc – NoSQL datastores ● MongoDB ● Elasticsearch ● Solr ● Neo4j ● HBase ● Redis ● etc – Message Queues ● Kafka ● RabbitMQ ● Redis ● etc
  • 59. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 59 / 156 DataProcessing ● Process and compute data to extract value and insights ● Process data either in batch or real time, ideally in distributed manner ● Provides algorithms for complex computations ● Tools / Equipments includes, but not limited to: – Any programming languages, especially R, Scala, Python – Distributed batch processing engines ● MapReduce ● Tez ● Hive ● Pig ● Spark – Distributed stream processing engines ● Storm ● Celery ● StreamParse – Traditional ETL tools ● Talend ● Pentago ● Oracle Data Integration
  • 60. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 60 / 156 DataServing ● Serve processed data in for high performance analytical queries – Utilizes highly optimized data structures for purpose-specific queries ● Tools / Equipments includes, but not limited to: – High performance OLAP ● Druid ● Kylin – Graph data stores ● Neo4j ● ArangoDB – Search engines ● Elasticssearch ● Solr – Time series databases ● Graphite ● InfluxDB ● OpenTSDB ● Prometheus
  • 61. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 61 / 156 DataVisualization ● Display data summary and reports in the form of visual diagrams and charts ● Visual data discovery and exploration ● Tools / Equipments includes – Traditional BI / reporting tools ● Pentaho ● Jasper ● SAS ● SpagoBI ● Microstrategy ● etc – Real-time dashboarding ● Grafana ● Kibana – Visualization libraries ● D3.js ● DC.js ● Shiny ● Bokeh ● etc – Visualization platforms ● Tableau ● Redash ● Superset
  • 62. Understanding Open Source License & Consumption Model
  • 63. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 63 / 156 ModernBig DataTechnologiesAreDriven ByOpen SourceCommunity
  • 64. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 64 / 156 OpenSource Software Definition ● A software which are licensed under a license that guarantees the following rights – Free Redistribution ● The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale. – Source Code The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed. – Derived Works ● The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software. – Integrity of The Author's Source Code ● The license may restrict source-code from being distributed in modified form only if the license allows the distribution of "patch files" with the source code for the purpose of modifying the program at build time. The license must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software.
  • 65. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 65 / 156 OpenSource Software Definition – No Discrimination Against Persons or Groups ● The license must not discriminate against any person or group of persons. – No Discrimination Against Fields of Endeavor ● The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research. – Distribution of License ● The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties. – License Must Not Be Specific to a Product ● The rights attached to the program must not depend on the program's being part of a particular software distribution. If the program is extracted from that distribution and used or distributed within the terms of the program's license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the original software distribution. – License Must Not Restrict Other Software ● The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be open-source software. – License Must Be Technology-Neutral ● No provision of the license may be predicated on any individual technology or style of interface.
  • 66. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 66 / 156 Open Source Does NotMeanNoCopyright ● Open Source software are copyrighted and not public domain – The author retains the copyright and intellectual property, however, the author choose to grant licensee of the software additional rights which normally are not granted under proprietary license – Any users of the software automatically become the licensee the moment they acquire a copy of the software – Open Source authors usually will re-use legal license documents already exist in the Open Source community as the license for his software ● Should you not comply with terms and conditions in the license document, the author have the rights to enforce the license
  • 67. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 67 / 156 Types OfOpen Source Licenses Permissive Weak Copyleft Strong Copyleft ● Most flexible ● Derivative works are not required to be Open Source or using the same license ● eg: MIT, BSD ● Some parts of derivative works are required to be using the same license ● Usually this license is used on a library ● Modifications to itself are required to be released under same license, but projects importing the library are not required to be using same license ● eg: LGPL ● Strict enforcement of same license for any derivative works ● All projects importing libraries provided by software licensed under this license are required to be also released under the same license of the original work ● eg: GPL, AGPL
  • 68. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 68 / 156 Consumption Model ● Common misconception of Open Source in enterprise – Open source is free – Open source is not licensed – Open source comes without support – Open source software is not stable and change too fast ● Open Source is a software development model, but not exactly a software consumption model. The consumption model is more or less similar with other software ● Software can be free, but human time are not Software developed in public, possibly with community contribution, and many frequent improvements Open Source software distribution company takes a snapshot of codebase, stabilize, integrate, create support, training and warranty model, and productize the software Enterprise customer buys productized software, and receive support, training, and warranty from distribution company and services from SI Ecosystem of System Integrator, ISV, trainers provide professional services and added value for using the productized software
  • 69. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 69 / 156 UpstreamSoftware Vs DownstreamEnterprise Product Upstream Enterprise ● Rapid changing ● Latest and greatest features ● Can be unstable ● No warranty, no support, or minimally supported ● Most of time free of capital cost ● Less changes over short period of time ● Tried and tested features ● Generally more stable ● Comes with warranty, support SLA, training and certifications ● Charged for support & warranty subscriptions and professional services
  • 70. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 70 / 156 DoI Have touseenterprise product? Do I have to use Enterprise edition of an Open Source software? Are going to use it as hobby or professional? Are you using it for R&D or for production? Do you have the budget? Do you have any regulations against using software without warranty or internal expertise in production? Professional Hobby R&D Production No Yes No Yes Not necessary Required Recommended
  • 71. A Framework For Organizational Data Journey
  • 72. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 72 / 156 Big DataTransformation Journey
  • 73. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 73 / 156 Stage 1: Data DiscoveryonActivearchive ● Initial starting infrastructure for Proof of Value – 2 master, 3 workers for batch/interactive processing – 1 node for stream processing ● Select several datasets and ingest both current data and historical archive which then be made available for generating analyzing patterns over long historical context. Also ingest related tables – eg: Touch N’ Go transactions ● Familiarize with technologies and processes involved in Big Data ● Create reports / dashboards detailing discoveries from analysis of historical archive ● Time frame: 6-12 months
  • 74. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 74 / 156 Stage2:Data Lake ● Medium scale cluster for central data lake – 3 master, 10-15 workers for batch/interactive processing – 2 nodes for stream processing ● On-board more dataset from various internal sources into the data lake to get 360 view of the organization. – eg: CRM, ERP, Website logs, Device logs, etc ● Develop and launch reports and dashboards related to cross-dataset relationships and patterns – eg: 360 view of customer ● Time frame: 6-24 months
  • 75. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 75 / 156 Stage3: Advanced Analytics ● Large scale cluster for complex computation – 4 masters, 20++ workers for batch/interactive processing – 4++ nodes for stream processing ● On-board external data sources for enrichment against internal datasets – eg: Social media, web scrapers, IoT sensors ● Aggressive data collection and data mining as strategic direction and asset ● Identify repetitive patterns, create model to predict them, and leverage its use in AI-powered applications ● Time frame: 12-24 months
  • 76. www.abyres.net (c) 2017 Abyres Enterprise Technologies Sdn Bhd Page 76 / 156 Stage 4: Continuousimprovement ● Continuous data-driven transformation and innovation
  • 78. The Data Journey to a Golden Batch The Data Journey to a Golden Batch
  • 79. 79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study Merck’s Journey Improving Life Sciences Manufacturing Yields Presents a Complex Data Discovery Challenge  Vaccine manufacturing requires precise control of complex fermentation processes  Two batches of a vaccine, produced using an identical manufacturing process, can exhibit significant yield variances  Batches that fail quality standards can cost $1 million each  Data for one vaccine was stored across 16 different systems, and high storage costs limited the length of data retention
  • 80. 80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Merck’s Journey The Golden Batch Scientific Search Sensor Data Storage Vaccine Yield Optimization Innovate Renovate The Journey to the Golden Batch  Combined 10 years data on one vaccine: 1 billion records  5.5 million batch comparisons  1st year yield boost of 40K more doses  $10M profit impact  McKinsey: 50% yield increase Epidemiology
  • 81. The Data Journey to Safe Roads The Data Journey to Safe Roads
  • 82. 82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study Progressive’s Journey Progressive Wanted to Ingest IoT Data to Predict Risk for its Usage- based Insurance Product  Progressive Snapshot offers usage-based insurance through an in-car sensor that transmits IoT driving data  Sensors collect up to six months of data from drivers and the data is archived for years, per regulatory requirements  Progressive’s existing systems were not scaling efficiently  It took 5–7 days to transform only 25% of available UBI data
  • 83. 83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Progressive’s Journey Rewarding Safer Drivers and Improving Traffic Safety  Snapshot plug-in devices capture driving detail  Progressive stores more than 10 billion miles driven  Through a web app, customers can review their own driving detail and improve their safety  Snapshot and usage-based insurance drove $2.6 billion in 2014 Progressive premiums Innovate Renovate Safe Roads Claims Notes Mining Individual Driving Histories Usage-Based Insurance (UBI) Web Log Analysis Online Ad Placement Sensor Data Ingest
  • 84. The Data Journey to Better Health The Data Journey to Better Health
  • 85. 85 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study Mercy’s Journey Mercy Medical System Sought a Data Lake for a Single View of its Patients – “One Patient, One Record”  Existing platform impeded goal of enriching Epic data for 1 million patients over 35 Hospitals and 500 clinics  Moving Epic EMR data to Clarity EDW took 24 hours and was “never going to enable real-time analytics”. Now that takes 3-5 minutes with HDP.  Improved billing processes resulted in $1M additional annual revenue from newly documented secondary diagnoses and care
  • 86. 86 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study Mercy’s Journey Better HealthBilling Vital Sign Monitoring Single Patient Record Lab Notes Archive Privacy Database Medical Decision Support Device Data Ingest Preventive Care Epic Enrichment OPEX Efficiency Epic EMR Replication Innovate Renovate Better Health Through Data  Searches of free-text lab notes, speed researcher insight from “never” to “seconds”  Ingest of ICU vital signs increased by 900X, letting clinicians respond more quickly  Mercy is building real-time tools to support surgical decisions and preventive care
  • 87. 87 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Webtrends The Data Journey Towards Personalized Online Ads Webtrends The Data Journey Towards Personalized Online Ads
  • 88. 88 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Massive Volumes of Weblogs Fueled Webtrends Growth—and also its Skyrocketing Storage Costs Webtrends’ Journey  Webtrends provides digital marketing solutions for more than 2,000 companies in 60 countries – processing 13 billion daily online events  Data used to be processed in relational databases, stored on large NAS appliances, which were not economical at scale  Processing occurred on-premises, without cloud-based capabilities  Diseconomies of scale hampered the company objective to help its customers predict optimal online ad placement
  • 89. 89 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Webtrends’ Journey Personalized Online Ads Per-Customer Click Path Web Log Analysis SQL Server Offload “We’re able to…look at this data set and process it and do predictions, behavioral analysis. We can do things that allow us to determine ROI for different actions and behavioral patterns.” Peter Crossley, Chief Architect Behavioral Segmentation Ad Click Predictions LCV Analysis Innovate Renovate Petabytes of Weblogs Analyzed with Spark at Scale  Data streams from a vast array of desktop and mobile devices  13 billion daily events collected in fewer than milliseconds per event  No data cleansing necessary prior to analysis with Apache Spark  2 clusters consolidated into 1 YARN- based HDP cluster  Launched new product Webtrends Explore™ – powered by HDP
  • 90. 90 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Watch The Webtrends Videos https://youtu.be/hwpGj57VGz0 https://www.youtube.com/watch?v=LifVwIwN61E
  • 91. The Data Journey for Cyber Security The Data Journey for Cyber Security
  • 92. 92 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Symantec’s Journey Analyzing Streaming Threat Data to Increase Velocity for Time to Protection  The Symantec™ Global Intelligence Network includes more than 57 million attack sensors over 157 countries  Data streams from 75 million users on 120 million devices  Legacy platforms created 3-4 hour processing latencies to analyze logs files for digital threats  Attackers could exploit those processing time windows
  • 93. 93 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Symantec’s Journey Digital Security Metadata Capture Threat Predictions Attacker Detection Unified Security Security Log Analysis Threat Archive Device Data Ingest Threat Detection Greenplum Offload Innovate Renovate Data Science Speeds Time to Protection  Threat detection latency reduced from 4 hours to 2 seconds  Time to protection improved 5000x  Machine learning over tens of petabytes of historical data predicts threats to customers  Cloud team uses Ambari and Cloudbreak for dynamic clusters to meet peak workloads
  • 94. 94 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Data Journey to Secure Telco Networks The Data Journey to Secure Telco Networks
  • 95. 95 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Neustar’s Journey Neustar’s Telco Network Analytics Business was Limited by High Data Storage Costs  Neustar offers its telecommunications customers Network Analytics services, but faced a 2011 cost of $100,000 per terabyte of storage  It could only economically capture 10% of the data flowing through its networks, retained for 60 days  Neustar CEO challenged her data warehousing group to retain 100% of the network data for at least one year
  • 96. 96 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Secure telecom networks Single View of the Network Network Data Storage Proactive Network Protection Enriched App Data DDoS Attack Mitigation Rapid Threat Response New Info Services Neustar’s Journey Innovate Renovate Architecture Renovation Funded Service Innovation  Cost per terabyte reduced from $100K to under $250  100% of data now retained, growing storage capacity 150X  Data retention extended from 60 days to 2 years  Elimination of existing support fees saved millions annually  New data assets help Neustar grow its product portfolio
  • 97. 97 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Data Journey to a Balanced Supply Chain The Data Journey to a Balanced Supply Chain
  • 98. 98 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cardinal Health’s Journey  Cardinal Health supplies equipment and medicines to 85% of US hospitals and clinics  Limited visibility into the entire supply chain prevented suppliers from understanding how their drugs were prescribed  Acute pharmacists couldn’t see all the product options that they could prescribe for various conditions Data Ingest Constrained Analysis of the Medical Supply Chain at Fuse by Cardinal Health
  • 99. 99 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cardinal Health’s Journey Balanced Medical Supply Chain Drug Supply Chain Analytics Sensor Data Ingest Prescription Archive Pandemic Response Outcome-based Medicine Clinical Decision Support Public Data Ingest Drug Cost Optimization Single Patient Record Cardinal Health Launched a New Line of Business  Fuse by Cardinal Health aims to make healthcare safer and more cost-effective  Team enriches supply chain data with public sources – bringing suppliers, providers and patients closer together  Data processing speeds doubled  Fuse shows suppliers how their drugs are used Innovate Renovate
  • 100. 100 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Anonymous Case StudiesAnonymous Case Studies
  • 101. Page 101 / 156 Creating Opportunity Data: Clickstream & Server Log Online Ad Placement Analytics for Mega-Retailer Problem Digital ad firm unable to connect impression & click data • One of the world’s largest retail websites made guesses about online ad placement based on Google analytics • Clickstream data flowed in 100s of MB per hour and billions of rows per month, this data strained existing architecture • Inability to connect ad impressions to clicks to purchases • No ability to detect browsing device, geo-location, or whether customer was in the store Solution Unified web tracking data repository provides 360-degree view of online behavior • Impression files and click files are stored in the same data lake, and easily joined for customer insight • With better targeting, fewer ads can be placed, improving overall customer web experience • Social media data will be added for brand sentiment analysis Advertising Manages online media programs for retail e- commerce websites AD1
  • 102. Page 102 / 156 Monetize Anonymous & Aggregate Banking Data Problem Valuable banking data needed to be anonymous & unified • Bank possesses data that indicates larger macro-economic trends, which can be monetized in secondary markets • Regulations and company policies protect customer privacy • Data sets are isolated in legacy silos controlled by LOBs • IT challenged by joining data while guaranteeing anonymity Solution Cross-bank data lake for aggregate data with secure access • Multiple data sets abstracted from source platforms • Single point of security & privacy for de-identification, masking, encryption, authentication and access control • Mortgage bankers, consumer bankers, credit card group and treasury bankers have access to the same cross-sell data • Interoperability with partners SAS, R, RedHat & Splunk • Economies of scale for compression & archiving data • Significant reduction in storage costs from prior platforms Creating Opportunity Data: Structured, Clickstream, Social & Unstructured Banking One of the largest US banks BK1
  • 103. Page 103 / 156 Sensor Data Monitors Buildings for Efficiency Problem Managing service calls on HVAC in commercial buildings • More than 70K systems in buildings around US • Systems transmits data, but mostly kept on site or discarded • Servicing costs high, due to limited data on each unit • Data on work orders, sales orders, service orders stored in different databases and not correlated Solution Data consolidation and predictive analytics for efficiency • Raw data from HVAC sensors will land in HDP, along with work order, sales order and service call data • System will predict component failures for: – Product upsell  increased revenue – Service call efficiency  reduced costs • Management insight for a new service offering Improving Efficiency Data: Sensor Building Management Building efficiency and power solutions >$420B in revenue >140 employees BM1
  • 104. Page 104 / 156 Sensor Data From Smart Electricity Meters Problem Utility needs to match electricity supply with demand • Utilities cannot store power, it needs to be used • Some energy load is predictable, some is unpredictable • Overproduction requires cutting back, running below capacity • Underproduction risks starting less efficient “peaker plants” • Smart meter data allows real-time analysis that can help effectively match energy production with consumption Solution Predict demand spikes by analyzing real-time sensor data • Hive + Storm on YARN streams data into Hadoop • R + Mahout to analyze aggregate consumption trends for predictive algorithms • More effective matching of energy production and consumption reduces energy costs and emissions Improving Efficiency Data: Sensor Energy One of the world’s largest producers of electricity >$100B in revenue >39 million customers >150K employees EN1
  • 105. Page 105 / 156 Proactive Oil Field Decisions for Pump Equipment Utilization Problem Limited visibility into utilization of pump equipment • Oil field services: exploration, drilling, well construction & production optimization • Company manages huge base of costly equipment in the field, in 80 countries • Time consuming, manual effort required to collect & analyze pump equipment data • Standard data warehouse model & traditional reports did not scale well & yielded incomplete results Solution Combine structured data, sensor & log data for proactive equipment decisions • Reduces manual time and effort to collect & analyze data from sensors above and below ground, as well as log data from pump trucks • Big Data project runs in Accenture Cloud, with Accenture providing data architecture, data science and project management services • Project integrated with embedded technologies from Hortonworks technology partners: Microsoft, SAP & HP • Project goal: reduce equipment expense and improve margins Improving Efficiency Data: Structured, Sensor & Server Log Energy Major provider of upstream oil field services >$29B in revenue Operations in 80 countries >75K employees EN2
  • 106. Page 106 / 156 Powering Music Recommendations Problem CDH cluster failed, causing down time • Highly technical team was running CDH cluster, without support • CDH failed, CTO asked team to research support options • Hive table stores data on all music streamed by users • Data on Hive is mission-critical: used to recommend music & to pull monthly reports used to pay each music label • Data expertise is their only sustainable competitive advantage Solution HDP powers music recommendation engine • Stable recommendation engine and reconciliation reports • Pro-active technology partnership with their engineers, who are consumers of & contributors to Hadoop • 2X per year, Hortonworks reviews cluster for optimization • Data was migrated from CDH to HDP, quickly and easily Creating Opportunity Data: Clickstream & Server Log Entertainment Online music streaming >$500M in revenue >24M users ET1
  • 107. Page 107 / 156 Donor and Voter Analytics for a Political Organization Problem Limited insight into donor behavior & voter mobilization • Fundraising phone services lack analysis on why donors give • For campaign management, needed analysis on what factors cause constituents to register and vote • Client knew they needed Hadoop for storage and analysis • Needed education on roadmap, use cases and execution Solution Donor data store improves revenue from tele-fundraising • Speed: Rapid delivery of donor data store • Deployment flexibility: Runs in Windows environment • Targeted: Phone reps talk to donors about their important issues • Discovery: Explore and enrich data from campaign operations Creating Opportunity Data: Unstructured Fundraising Political organization dedicated to tele- fundraising, voter contact and media services >$1M in revenue ~100 employees FR1
  • 108. Page 108 / 156 Analysis of Gamer Data for Future Innovation Problem Social gaming platform needs more storage, more stability • 4 million monthly gamers generate customer interaction data • Existing CDH cluster was going down every month • Desired tight integration with Datameer analytics tools • Needed interactive query, Impala was not meeting that need • Rapidly growing user base, need to manage cluster as it scales Solution HDP for stability at scale, tight integration with Datameer • Stable cluster that doesn’t fall down like CDH did • Easy data extracts from SQL server • Datameer analytics tools certified on HDP • High-performing Hive queries • Ambari for provisioning and maintenance as cluster scales up Creating Opportunity Data: ETL Gaming Online strategy & role playing games ~4M users ~$325M in revenue ~500 employees GM1
  • 109. Page 109 / 156 Gamer Migrates a Homegrown Cluster to HDP Problem Social gaming platform used Hadoop, but needed support • Social gaming platform built its own Hadoop cluster • Heavy users of Hive for analysis of player behavior • Hadoop analysis informed strategy to prolong length of play, purchase virtual goods and respond to timed in-game events • Heavy processing needs and ~1 petabyte of data outpaced the company’s ability to support and extend its in-house cluster Solution HDP functionality + Hortonworks support = better games • Easy migration from native Hadoop cluster preserved data and processing tools • HDP cluster includes a more complete ecosystem: Ambari, Flume, HBase, Hive, Oozie, Pig, Sqoop, Storm, ZooKeeper • Social media sentiment analysis combined with data on player stats and behavior, used to improve games their revenue Creating Opportunity Data: Clickstream, Server Log, Social & ETL Gaming Social gaming ~5M users >$100M in revenue ~500 employees GM2
  • 110. Page 110 / 156 Clearing the Federal ETL Consulting Backlog Problem Federal consulting practice faces ETL backlog • Sequestration budget cuts created demand for ETL from SAS • Consulting practice faces backlog of millions of dollars consulting on offload from SAS at 20 fed civilian agencies • After offload, all data must still be easily accessible Solution Rationalized data storage saves taxpayer money • Federal civilian agencies reduce ongoing data storage cost • No loss of data or disruption to operations • Base SAS and SAS/ACCESS are two out-of-the-box solutions for connectivity between SAS and Hadoop, via Hive Improving Efficiency Data: ETL Government Professional service provider consulting on federal projects >$13B in revenue >50K employees GV1
  • 111. Page 111 / 156 Processing Time-Sensitive Employment Reporting w/ Confidence Problem Agency reporting on labor data has 9 working days to prepare report • Agency reports on inflation, pay and benefits, unemployment levels, labor productivity • Agency’s monthly employment report moves financial markets • State agencies report unemployment data to federal office by first Friday of the month • Total data set is hundreds of millions of rows in 30 comma-separated files • If team finds errors in state data, it may take days to correct with the state affiliate • Final report must be published by the third Friday of the month, time is precious Solution HDP speeds processing and improves confidence in unemployment findings • Hortonworks partner OpenOsmium introduced Hortonworks to client team • Federal budget pressures created favorable policies towards open source software • POC pilot: processing one of thirty files on HDP/Amazon Cloud solution • Processing time reduced from 18 hours to less than 1 hour • Absolutely no disruption to existing systems or operations • Cloud cluster runs on “as needed” basis, shut down remotely when not needed Improving Efficiency Data: Structured Government US federal government labor agency GV2
  • 112. Page 112 / 156 Sentiment Analysis for Government Programs Problem Min. of Ed. felt removed from public sentiment on programs • In-person events lacked reach and persistence • Ministry of Education wanted to understand sentiment from citizenry on specific issues such as childhood obesity • Two dedicated analysts pored over social media stream and provided daily reports to member of parliament • IT team sought improvement over limitations of manual analysis Solution Powerful “same day” sentiment analysis helps outreach • Team produces daily memos on public sentiment, now with: – Reach: includes opinions from broader base of citizenry – Confidence: more data, more confidence in opinion analysis – Frequency: daily reads show policy-makers changes over time – Precision: allows micro-analysis of specific issues and geos • Solution aligns to government’s support for open source • Individual social media authors receive invitations to in-person meetings with government ministers Creating Opportunity Data: Social Government European national government GV3
  • 113. Page 113 / 156 Sensor Data for Healthcare Supply Chain Problem Medical products have limited shelf life, tracking essential • Medical products delivered to pharmacies and hospitals • Epidemics require agile changes to delivery schedules • Materials are time sensitive and climate-controlled • Delivery logistics are complex & subject to risks outside of the company’s control (product availability, weather, traffic, etc) • Slow delivery can harm supplies and medical outcomes Solution Sensor data protects supply chain, improves efficiency • Sensor data from individual items and vehicles will give the company unprecedented supply chain visibility • Analytic platform enable predictive algorithms for infrastructure planning, disease forecasting and supply chain forecasts • Better tracking reduces waste, improves customer confidence and patient health Improving Efficiency Data: Sensor Healthcare Supplier of pharmaceuticals & medical products to pharmacies & hospitals >$100B in revenue >30K employees HC1
  • 114. Page 114 / 156 Predictive Analytics & Real-time Monitoring of Vital Signs Problem Unable to store sufficient data for decision support • 22 years of data for 1.2 million patients ~ 9 million records • Data on legacy system was not searchable nor retrievable • Cohort selection for research projects was slow • For decision support, clinicians had minimal access to historical data gathered across all patients Solution Unified repo provides data to both researchers & clinicians • “View only” legacy system retired, saving $500K • 9 million historical records now searchable & retrievable • Records stored with patient identification for clinical use, same data presented anonymously to researchers for cohort selection • Real-time monitoring: patches record vital signs every minute, algorithms notify clinicians if numbers cross risk thresholds • Readmit reduction: heart patients weigh themselves daily, algorithms notify docs about unsafe weight changes Improving Efficiency Data: Sensor, Social & ETL Healthcare Public university teaching hospital Consistently rated by US News & World Report as among America’s best hospitals >17K patient admissions >400 physicians ~12K surgeries (‘12) HC2
  • 115. Page 115 / 156 Affordable, Scalable Data for Healthcare Analytics Problem Relational database architecture limited data exploration • Develops and maintains analytic applications for doctors • Company couldn’t access the volume or variety of data they wanted for those applications • Analyzing huge data sets on relational databases was too slow Solution HDP reveals new big data insights, with costs savings & flexibility at scale • Link and access new types of data that are currently outside of the healthcare domain such as: pharmacy receipts, text messages or patient web searches • Per-node TCO of data on HDP was 25% that of current relational DB • Open-source Hadoop ecosystem gives multiple hardware and software integration options as company scales its architecture Creating Opportunity Data: ETL Healthcare Analytics tools and decision support for the healthcare industry ~$130M in revenue >2K employees HC3
  • 116. Page 116 / 156 Rapid Detection & Intervention for Stroke Prevention Problem Conditions appearing to be strokes delay short windows for critical intervention • Some conditions show stroke-like symptoms (e.g. migraines or muscle spasms) • Stroke neurologists spend 50% of their time with non-stroke patients • Transient ischemic attacks “TIAs” are mini-strokes that present like migraines, but are highly predictive of future full-blown strokes within the following days • Incomplete or slow access to patient data hampers clinician’s ability to respond promptly to TIAs Solution HDP unifies present day images with historical data to quickly identify TIAs • Patient contact records (calls to the province’s health hotline) merged with population historical records and present-day medical images improve diagnosis • Algorithms on population risk factors (weight, age, cardiovascular problems) are mined for probability that a given patient has similar risk factors • With quantified risk factors, doctors quickly identify those at risk of imminent stroke • Prescriptions of blood thinners, exercise and diet reduce incidence of those strokes Improving Efficiency Data: Sensor, Unstructured & Structured Healthcare Top Canadian research university, researching epilepsy, stroke care and brain surgery outcomes in government-run healthcare system HC4
  • 117. Page 117 / 156 Management of Chronic Health Conditions Such as Epilepsy Problem Epilepsy is a chronic, unpredictable & difficult to treat condition • Epilepsy can go undiagnosed while seizures are minor • Epileptics are at higher risk of depression, making condition more difficult to manage • Tabular data is gathered through treatment at epilepsy specialty clinics • Additional tabular data in the system is difficult to combine for a complete picture • Social data on patient behavior is unavailable for combination with tabular data on clinical history and pharmaceutical prescriptions Solution HDP healthcare data lake joins disparate data, for better disease management • Data lake for a 360-degree view of the patient: electronic medical records, history of clinic visits, Facebook, Twitter & sentiment survey data • Regular, patient self-reporting with targeted surveys via mobile and web applications • Dynamic calculation of changing sentiment scores useful for proactive outreach • Clinicians will have ability to reference current psychographic & sentiment data immediately before (and during) the patient’s scheduled clinical visits Improving Efficiency Data: Social & Structured Healthcare Top Canadian research university, researching epilepsy, stroke care and brain surgery outcomes in government-run healthcare system HC5
  • 118. Page 118 / 156 Robotics & Real-Time Decision Support in Brain Surgery Problem Brain surgeons make real-time decisions using only a fraction of available data • Brain surgeons may spend hours working (non-destructively) through brain tissue • For aneurisms, they must clamp a weak point in a specific blood vessel • Surgical assistant presents ~100 clamps, which surgeon uses until finding a good fit • Clamps exposed to surgical environment are discarded at a cost of $100K/surgery • Time selecting/testing clamps can negatively affect surgical outcomes Solution Robots, streaming video & surgery inside an MRI with real-time decision support • Researchers developed non-magnetic robots that surgeons control within an MRI • Constant streaming of MRI imaging helps decisions while surgery is underway • Recordings of MRI data stored in Hadoop, analyzed w/ machine-learning algorithms • MRI images compared to surgical outcomes for insight into best practices Improving Efficiency Data: Sensor, Unstructured & Structured Healthcare Top Canadian research university, researching epilepsy, stroke care and brain surgery outcomes in government-run healthcare system HC6
  • 119. Page 119 / 156 Data Science on Text-based Health Claim Records Problem Claims data in PDFs, hard to identify coding errors • Produces applications for medical decision support • Goal is marrying electronic health records with claims data • 300K daily connections with individuals around unstructured data in PDFs (claims records and patient-reported outcomes) • Data analysis is disjointed, difficult to identify patients and events that have been mis-coded or incompletely coded Solution Datasets unified in Hadoop to improve health outcomes • Optical character recognition & natural language processing • All of the unstructured, text-based data stored on HDP • Coding errors will be identified much more efficiently • Impartially coded records can also be identified • Coding efficiency will improve revenue • Analysis of underlying data will improve health outcomes Improving Efficiency Data: Unstructured Insurance – Health Large US medical insurer >$100B in revenue >100K employees IH1
  • 120. Page 120 / 156 Insurance Data Lake to Manage Risk Problem Challenges merging new & old data hamper analysis • Traditional and newer types of data were both growing quickly but were difficult to combine in the EDW • “Schema on load” requirements of EDW platform limited ingest of some data with significant predictive power • Company missed data-driven ways to serve customers • Process of separating legitimate from fraudulent claims created “needle-in-a- haystack” problem Solution Common platform for all types of data improves up-sell and reduces fraud • “Schema on read” Hadoop architecture means that more data sources can be easily ingested to enrich predictive analytics • Agents use big data insights to determine the best action for valued customers and recommend those in real-time • Claims analysts and underwriters process streaming data to quickly flag fraud risks and fast-track legitimate claims Creating Opportunity Data: Structured, Clickstream, Server Log Insurance – Health Large US medical insurer >$30B in revenue >20M members ~35K employees IH2
  • 121. Page 121 / 156 Speeding Analysis for Usage-Based Car Insurance Problem Risk analysis lagged because of architecture gaps • Business insight from data analysis was too slow • Growing volume, velocity and variety of incoming data taxed existing systems & processes • ETL process across disparate systems only captured 25% of the dataset, took 5-7 days to complete Solution Speed time-to-insight w/ clickstream analytics & faster ETL • Clickstream analytics – Moving from a hosted Azure platform to HDP on site will improve performance and analytical functions (with Apache Hive) • ETL acceleration – Process 100% of the data, in three days or less Creating Opportunity Data: Clickstream & ETL Insurance – Property & Casualty Personal auto & other property-casualty insurance >$17B in revenue ~28K employees IP1
  • 122. Page 122 / 156 Data Lake for P&C Insurance Claim Analysis Problem Structured data scaled, unstructured data analysis did not • Large P&C insurance provider had systems for analyzing structured data at scale • Unstructured data from claims notes and social media data had the potential to add valuable information to claims analysis • Structured data analysis scaled, but joining this information with hand-written or social media data did not scale • Limited data visibility hampered underwriting and claims Solution Merge structured & unstructured data for better decisions • “Schema on read” Hadoop architecture means that more data sources can be easily ingested (text and social media) • Previously disparate data sets are joined for greater insight • Larger data sets fed to front-end business tools provided by Hortonworks partners: SAS, Tableau and QlikView Improving Efficiency Data: Structured, Social & Unstructured Insurance – Property & Casualty Major provider of property casualty, life and mortgage insurance >$65B in revenue >60K employees Operations in >100 countries IP2
  • 123. Page 123 / 156 Maintaining SLAs for Equity Trading Information Problem Meeting 12 millisecond SLAs for “ticker plant” • Daily ingest: 50GB server log data from 10,000 feeds • Four times daily, this data is pushed into DB2 • Applications query this data 35K times per second • 70% of queries are for data <1 year old, 30% for >1 year old • Current architecture can only hold 10 years trading data • Growing volume puts performance at risk of missing SLAs Solution Meeting SLAs with confidence • HBase provides super-fast queries within SLA targets • ETL offloading to Hadoop allows longer data retention, without jeopardizing fast response times Improving Efficiency Data: Server Log & ETL Investment Services Highly trafficked website providing business and financial information ~15K employees IS1
  • 124. Page 124 / 156 Banking Data Lake for 100s of Use Cases Problem Architecture unsuited to capitalize on server log data • Huge investments company generates valuable data assets which are largely unavailable across the organization • Current EDW solutions are appropriate for some data workloads but too expensive for others • Financial log data is difficult to aggregate & analyze at scale • Short retention hampers price history & performance analysis • Limited visibility into cost of acquiring customers Solution Multi-tenant Hadoop cluster to merge data across groups • Server log data will be merged with structured data to uncover trends across assets, traders and customers • ETL offload will save money for Hadoop-appropriate workloads • Longer data retention enables price history analysis • Joining data sets for insight into customer acquisition costs • Accumulo enforces read permissions on individual data cells Creating Opportunity Data: Server Log Investment Services Global investments company > $1.5 trillion assets under management > $14B billion in revenue ~ 50K employees IS2
  • 125. Page 125 / 156 Anti-Laundering & Trade Surveillance for Investment Firm Problem Lags in back office system limit intraday risk analysis • 15M transactions and 300K trades every day • Storage limitations required archiving, limiting data availability • Trading data not available for risk analysis until end of day, which hampers intraday risk analytics and creates a time window of unacceptable exposure Solution Data lake accelerates time-to-analytics & extends retention • Shared data repository combines more comprehensive data sets about all firm activities, improving data transparency • Operational data available to risk analysts earlier, same day • Trading risk group will process more position, execution and balance data and hold that data for five years • Hadoop enables ingest of data from recent acquisitions despite disparate data definitions and infrastructures Creating Opportunity Data: Structured Investment Services Trading services for millions of client accounts >$16B in assets >4,000 advisors IS3
  • 126. Page 126 / 156 Creating Opportunity Data: Geolocation, Clickstream, Server Log, Sensor & Unstructured Customer Insight from Consumer Electronics Product Usage Data Problem Lacked central repository for efficient data storage & analysis • Rivers of data flow from millions of consumer electronic products • Company lacked a platform to capture new types of data: geolocation, clickstream, server log, sensor & unstructured • Unable to exploit key competitive advantage: unique customer insight from troves of big data Solution Efficient data storage unlocks value in company data • Hadoop data lake permits view into how customers use products across multiple types of data • Lower cost of storage improves the margin for retaining data • Powerful cluster includes many key ecosystem projects: Hive, Hbase, HCatalog, Pig, Flume, Sqoop, Ambari, Oozie, Knox, Falcon, Tez and YARN Manufacturing Consumer electronics >$180B in revenue >400K employees MF1
  • 127. Page 127 / 156 Improving Efficiency Data: SensorOptimizing High-Tech Manufacturing Problem Data scarcity for root cause analysis on products defects • 200 million digital storage devices manufactured yearly • Devices not passing QA scrapped at the end of the line • >10K faulty devices returned by customers every month • Limited data available for root cause analysis means that diagnosing problems is highly manual (physical inspections) • Subset of sensor data from QA testing retained 3-12 months Solution Data retention doubled, with 10x processing improvement • Repository of sensor data now holds larger portion of total data • Dashboard created 10x more quickly than before Hadoop • Data retained for at least 24 months • Manufacturing dashboard allows >1,000 employees to search data, with results returned in less than 1 second Manufacturing Digital Storage Devices >$15B in revenue >85K employees MF2
  • 128. Page 128 / 156 Creating Opportunity Data: Clickstream & Server Log Social Site Speeds Processing, Reduces Cost Problem Data growth outpaced existing Greenplum solution • 20M monthly unique visitors, and growing • Greenplum storage solution was slow and expensive • Operations team challenged by data growth • Analytics team hampered by slow processing speed Solution Processing speeds doubled, storage cost decreased • Operations team saw processing speed 2x of Greenplum’s • Significant cost savings from moving data to HDP • During this second year of support relationship, plans to move more workloads to HDP, for better insights at a lower cost Online Community Online social network >$50M in revenue >300M members 2nd year with Hortonworks OC1
  • 129. Page 129 / 156 Creating Opportunity Data: Clickstream, Server Log & Social Powering Professional Network Recommendations Problem Lack of a recommendation engine to promote connections • >13M non English-speaking members find jobs & connections • User interactions generate semi-structured data • Clickstream, server log and social data could feed recommendations • Company lacked stable platform to store, refine & enrich that raw data Solution Hadoop recommendation engine to compete with LinkedIn • Replaced existing CDH cluster • New types of data feed a superior recommendation engine that enhances the value of belonging to the community • YARN, Tez and Stinger initiative provide near-term functionality and long-term confidence Online Community Online professional network >$90M in revenue >13M members OC2
  • 130. Page 130 / 156 Better Romantic Matches with Data Science Problem Newer types of data unavailable for matchmaking algorithms • Unable to store clickstream data and user-entered content • Other types of data only retained for seven days • Recommendations would help users craft attractive profiles • High costs to store an ever growing amount of data • Relational data platform did not fulfill their requirements Solution Hadoop cluster for A/B testing, device analysis, text mining • A/B testing: consolidate email & clickstream from SQL databases • Usage patterns across devices, browsers and applications. Understand who uses their mobile app. • Mine user-created text (profile language and user-to-user communications) for recommendation engine • Longer data retention: find subtle trends with longer time window Creating Opportunity Data: Server Log & ETL Online Community Online dating site >300 employees OC3
  • 131. Page 131 / 156 360° View of Customer for Call Center Sales Problem Call center sales reps unable to recommend best product • 2000+ product lines • Multiple customer interaction channels (web, Salesforce, face-to-face, phone) • Poor visibility causes sales reps to miss opportunities and customer satisfaction suffers Solution Improve sales conversions with optimal product recs • Call center reps will understand every interaction with the customer, to improve service calls • Natural language analysis of rep emails to customers identifies best response language and coaching opportunities • Recommendation engine predicts the next best product for each customer Creating Opportunity Data: Unstructured Retail IT solution and equipment reseller >$10B in revenue >6K employees RT1
  • 132. Page 132 / 156 360° Customer View for Home Supply Retailer Problem Lack of a unified customer record across all channels • Global distribution online, in home and across 2000+ stores • Unable to create “golden record” for analytics on customer buying behavior across all channels • Data repositories on website traffic, POS transactions and in-home services existed in isolation of each other • Limited ability for targeted marketing to specific segments • Data storage costs increasing Solution HDP delivers targeted marketing & data storage savings • Golden record enables targeted marketing capabilities: customized coupons, promotions and emails • Data warehouse offload saved millions in recurring expense • Customer team continues to find unexpected, unplanned uses for their 360 degree view of customer buying behavior Creating Opportunity Data: Clickstream, Unstructured, Structured Retail Major home improvement retailer >$74B in revenue >300K employees >2,200 stores RT2
  • 133. Page 133 / 156 Using In-Store Location Data to Improve Cross-Sell Problem Retailer lacks data on how customers move through stores • Placement of product within department stores affects sales • Sales data is not specific enough to suggest specific changes • Online retailers can compare what shoppers view with what they buy, but they lack this insight in brick and mortar stores • Result: critical decisions about store layout, inventory Solution Micro-data on shopper location enables in-store analysis similar to website analysis: locations visited v. purchases • Apple iBeacon app captures in-store location data for shoppers that have the app on their iPhones • Data streams into HDFS on how customers move through their stores, relative to location of particular products • Enables real-time promotions to customers w/ smart phones, based on who they are and where they stand in the store • Historical data across all shoppers provides insight on store design Creating Opportunity Data: Sensor & Geolocation Retail Major omni-channel retailer > $27B in revenue >175K employees >800 stores RT3
  • 134. Page 134 / 156 Unified Data for Online Recommendation Engine Problem 5 data sets are fragmented, hampering product recs • 5 major data sets: inventory data, transactional data, user behavior data, customer profiles & log data • Unified view needed, to recommend items to users • Currently lack analytics dashboard across all types of data • Storing non-transactional data on EDW is expensive Solution Unified data lake for increased sales and lower costs • Unified 360° view for recommendations of similar products • Analytics dashboard joins clickstream w/ transactional data • Summary data stored in HBase, can be queried with web apps • Offload some data from Teradata EDW, to lower storage costs • Actively partnering with engineers to improve Hadoop Creating Opportunity Data: Structured, Clickstream, Server Log & Unstructured Retail eCommerce marketplace >$12B in revenue >30K employees RT4
  • 135. Page 135 / 156 Predicting Car Prices With High Confidence Problem Achieving 99.1% confidence in car price estimates • Goal to provide consumers & dealers reliable car price guides • Promise: 99.1% confidence that projected price paid will be within $20 of the average national price paid in a given week • As network of dealers grew, existing SQL Server data warehouse was expensive and difficult to scale Solution Cost savings & data reliability at scale in a data lake • Mission-critical price data moved to Hadoop architecture • Server log data flows into HDP with Flume • Analysis of this data allows analysts to further improve accuracy of estimates Creating Opportunity Data: Server Log & ETL Retail Online eCommerce service for buying and selling cars ~300 employees RT5
  • 136. Page 136 / 156 Recommendation Engine Improves Department Store Sales Problem Need to create better product recommendations • Multiple touch points: store, kiosk, web and mobile app • Wants to promote customized promotions, coupons & recs • Data was not integrated, making 360-view of customer behaviors impossible Solution Recommendations to all channels, based on data lake • Ingest all raw data from different product lines into HDP – Real-time data ingestion – Structured data ingestion • Transform raw data – ETL processing with Pig and Hive – Use Mahout and R to make recommendations • Recommendations will be fed to all channels – HBase serves recommendations to web site, kiosk and mobile app Improving Efficiency Data: ETL Retail Specialty department store >$19B in revenue >130K employees RT6
  • 137. Page 137 / 156 Faster Reports for Real Estate Agents Problem Accelerate reports on movers for real estate agents • 20 million monthly visitors to family of websites • Reports on movers not consistently generated quickly enough • Pressure from newer market entrants • High data storage costs reduce margins on data Solution More data for faster reports at a lower cost • Improved analytical efficiency speeds report turnaround • Data storage costs lower than before • Improved visibility into macro trends in real estate • Refine, explore and enrich the data better than competitors Improving Efficiency Data: Clickstream & ETL Software Operator of real estate websites ~$200M in revenue >1,000 employees SW1
  • 138. Page 138 / 156 Unified View Across Products, for Product Managers Problem Data fragmentation across products and verticals • More than 20 product lines • Multiple verticals: retail, financial services, healthcare, manufacturing, communications, utilities & government • Each product line has a separate data repository • Unified analysis across product lines was impossible Solution Data consolidation for cross-product customer analysis • Product managers will have unified data for analysis • Raw data from different products will land in HDP • Data will then be refined and transformed • Real time data ingestion with Flume • Batch data movement with Sqoop • ETL processing with Pig and Hive Creating Opportunity Data: ETL Software Data security software, cloud computing ~$130M in revenue ~1,100 employees SW2
  • 139. Page 139 / 156 Data Lake Protects Customers’ Enterprise Data Security Problem Batch processing created risk exposure, redundant systems drove costs up • Customer protects the world’s largest organizations from data security breaches and backs up their mission-critical data • Process client data to identify threats and vulnerabilities • Multiple acquisitions led to a redundant patchwork of big data analysis solutions, including: Greenplum, Netezza and Vertica • Six LOBs needed a common, multi-tenant data repository • Existing batch processing caused 15-minute latency window, with exposure risk Solution HDP data lake consolidates infrastructure, reduces cost & speeds response times • Consolidation into one HDP data lake represents savings of tens of millions of dollars • Multi-tenancy with YARN permits secure access to multiple LOBs • Real-time analysis with Apache Storm and interactive query with Apache Hive close the 15-minute risk window from earlier architecture • Data lake also used for marketing: clickstream analysis & 360-degree customer view Improving Efficiency Data: Server Log, Clickstream & ETL Software Global leader in data security, storage and system management software >$6B in revenue >18K employees SW3
  • 140. Page 140 / 156 Launching New Data Analysis Products Problem Enterprise customers have no visibility into performance • Platforms connect 3.4 billion transactions per year • Currently storing 90TB, growing at 20% YoY • All divisions retain 36 months, except healthcare network: 7yrs • Customers have no visibility into their companies’ activity on their commerce platforms • Client wants to add analytics services to cross-sell to existing customers and attract new customers Solution HDP data lake enables launch of new information products • Shorten data processing workloads from days to hours • Enable ad hoc analytics queries • Create data analysis products and services for customers of promotion, supply chain and healthcare networks • New product: anonymous reports that benchmark customer against competitors in same industry Creating Opportunity Data: ETL Software Operator of intelligent ecommerce networks >1,400 customers ~5K employees SW4
  • 141. Page 141 / 156 Product Managers Speed Product Innovation with Hadoop Problem Product managers needed to analyze server logs • 130K clients drive 780M transactions per day • Services incorporate streams from core CRM and 3rd party platforms like Twitter, Facebook and YouTube • Product managers need to capture and interpret server log data to analyze new feature adoption & performance • Unable to process current volume using relational data stores • Unable to retain enough data because of cost Solution HDP gives PMs power, reliability and liberty • Power: Analysis of more than 30TB per month • Reliability: Previous system broke every 2 weeks. No longer. • Liberty: Open source solution prevents vendor lock • HDP increases Product Management storage and analysis without corresponding increase in IT spend Creating Opportunity Data: Server Log Software Sales & CRM software, cloud computing ~$3B in revenue ~10K employees SW5
  • 142. Page 142 / 156 eCommerce Platform Uses Data Lake for Insight Problem New types of data difficult to store, unavailable for analysis • Millions of payments processed every day • Fraudsters selling fake items or extract buyer account info • Some creditors default, resulting in losses • Unable to store current volume using relational data stores • Unable to retain vintage data because of RDBMS storage cost Solution HDP data lake accelerates multiple analysis projects • Platform stores all new types of data: clickstream, social, sensor, geolocation, server logs and unstructured data • Detects and prevents theft: fraudsters stealing from members • Assesses credit risk: server log analysis & machine learning • Manages offers: aggregates data for advertisers • User experience: social sentiment analysis on usability • Site optimization: analyze clickstream for site improvements Creating Opportunity Data: Server Log Software eCommerce payments platform ~$6B in revenue >130M users ~13K employees SW6
  • 143. Page 143 / 156 Offloading Clickstream Data from Netezza Problem System receives millions of call detail records per second • Netezza EDW operating near capacity • Netezza housing exhaust data not required for intended reporting and analytics, leading to unnecessary expense • Enterprise IT maintained redundant data stores • Unable to store clickstream data to enrich consumer intelligence Solution Longer storage, lower cost & better consumer intelligence • Hadoop will recover premium Teradata cycles, currently used for transformations and data movement • Projected costs savings of >$1M by offloading exhaust data • Analysis of clickstream adds new dimension of customer view • Improved service efficiency: bill processing & reporting Improving Efficiency Data: ETL & Clickstream Telecom Major telecom provider ~ $25B in revenue > 40M customers TC1
  • 144. Page 144 / 156 Unified Household View of the Customer Problem Acquisitions & data explosion fragment view of customer • Recent acquisitions and proliferation of types of data caused fragmented view of customers • Data exists across multiple applications & data stores • Semi-structured data: social, sensors & networked devices • Difficult to integrate structured, semi-structured & unstructured data sets from so many distinct sources Solution HDP data lake delivers 360° unified household View • Stable environment for exploring and enriching the data • Store all of the data and retain it for longer • Parse on demand: no need to pre-parse data before loading • Analysis on demand: allows analysts to explore raw data and find unexpected truths in the data Creating Opportunity Data: ETL, Social, Sensor & Clickstream Telecom Major telecom provider, offering data networks & services > $100B in revenue > 200K employees TC2
  • 145. Page 145 / 156 Call Record Analysis for Improved Cell Service Problem System receives millions of call detail records per second • System enables proactive management of phone call quality • Call detail records (CDRs) are the raw data used for analysis • Millions of CDRs stream in every second • Storage is expensive & ingest rates are increasing 20% YoY • 24-hour data retention not sufficient to discover long-term trends Solution Longer storage & rich analysis improve customer service • HDP’s 10:1 compression allows affordable 6 month retention • Improved forensics on instances of poor call quality drive: – Informed decisions on expansion of transmission infrastructure – Predictive analytics on when to repair/replace equipment • Access to more data helps service reps solve customer issues in near real-time Creating Opportunity Data: Sensor Telecom Major telecom provider, offering data networks & services > $100B in revenue > 200K employees TC3
  • 146. Page 146 / 156 ETL: 100x the Data, 12x Longer, $3M Saved Problem Changing business model required new data architecture • Started in 1990s as neutral intermediary for telco networks • Network management market is mature • CEO challenged company to build business for data analysis and information services, related to telecom data • Netezza data capacity limited to 20TB • Only stored 1% of total dataset, retained for only 60 days Solution More data, stored longer, with $3 million in cost savings • Avoided $3M annual expense, compared to Netezza • Now storing 100% of data, retained for two years • Larger data set supports new, accurate information products • Improved access to data for more employees drives new innovation across the enterprise Creating Opportunity Data: ETL Telecom Telco information and analytics vendor $800M in revenue ~2,500 employees TC4
  • 147. Page 147 / 156 Searchable ETL for CDRs & Customer Data Problem Data storage costs limit the amount and types of data available for analysis • Teradata and Vertica used for data storage, ideal for certain data workloads, but unsuited for less structured types of data • Limited retention of call detail records (CDRs) • Limited analysis across call logs, CRM records & customer acquisition models Solution Data lake: ETL, data exploration & NPTB recommendations • Partners Teradata, HP and Impetus helped craft a solution • CDRs now retained for longer, improving visibility & analysis • Customer retention data can be correlated to service quality • Plan to integrate search for real-time NPTB recommendations • Improved customer acquisition and retention Creating Opportunity Data: Structured, Server Log & Geo-location Telecom Telco vendor specializing in VOIP > $800M in revenue > 2M subscribers ~ 1,000 employees TC5
  • 148. Page 148 / 156 Better Service to Premium Customers, for Less Problem Inability to identify base stations serving premier customers • CRM system and network logs were in isolated data silos • Company unable to analyze base station usage by premium customers, to prioritize investments • Info gap prevented optimal ROI on infrastructure investments Solution HDP joins structured CRM & unstructured network data at scale • Partnered with Datameer and HP to deliver a unified solution • Joins network data on utilization of base stations with CRM data on the value of customers using those stations most often • Optimizes service to the most valuable customers • Efficient resource allocation reduces overall cost to maintain network infrastructure Improving Efficiency Data: Structured & Server Log Telecom Major European telco > $800M in revenue > 300M customers > 100K employees TC6