Data Driven Journey to Demystifying Big Data

Paving the Way to
"Data Driven”
Mohd izhar firdaus ismail
solution architect
abyres enterprise technologies sdn bhd

www.abyres.net
(c) 2017 Abyres Enterprise Technologies Sdn Bhd
Page 2 / 156
About Me
● About Me
– Mohd Izhar Firdaus Bin Ismail
– Solution Architect & Head Data Engineering
Department, ABYRES Enterprise Technologies
Sdn Bhd
● About ABYRES
– System Integrator company focusing in consulting
and implementation of state of the art solutions
around Open Source IT infrastructure and data
center modernization
● Data Engineering & Big Data
● IT Modernization
● Enterprise Mobility Platform
● Unix to Linux Migration

www.abyres.net
Page 3 / 156
Outline
Demystifying
Big Data
History
of Big Data
Impact of
Big Data
Evolution
Of Data
Management
Big Data
Architectures
Data
Collection
Internet
Of
Things
Tools &
Technologies
Open Source
License
Framework
For Data
Journey
Hortonworks
Case Studies

www.abyres.net
Page 5 / 156
BackToBasics
Input Process Output
Storage

www.abyres.net
Page 6 / 156
Traditional Computing :“Small”Data
Storage
Low to medium rate of
data coming in, can be
easily collected using
softwares /
applications that runs
on single-core, or
multi-threaded
Processing low to medium amount of data can be easily
done in using simple architectures, processing data in
single/multi-core environments. Managing storage of data
is also simple, using single disk or an array of disks
merged together in RAID, in a single machine.
Outputs are showing
simple results that
easily can be viewed
using client softwares
that can even load the
whole dataset, while
still giving good
performance

www.abyres.net
Page 7 / 156
BigData Computing :Massive Data!
Storage
High volume and
velocity of data coming
in calls for a totally
different breed of data
collection and pipelining
software that can run in
distributed environment,
across thousands of
cores across hundreds
of computers
High volume of data with complex processing needs, especially in processing complex
relationships and complex unstructured data requires high throughput distributed computing to
process data and get results on time for business to use.
RAID arrays no longer enough to store the high volume of data, calling for distributed storage that
can easily scale to cater to additional data that are coming in with high velocity
Analyzing and visualizing
massive amount of data
to make sense of its
complex relationships
can’t be easily done
through basic charts,
calling for new
visualization techniques,
and strategy to minimize
client-side processing of
visualization in order to
have good rendering
performance

www.abyres.net
Page 8 / 156
CriteriaofBig Data
● Volume
– Data coming from various sources and
increased regulation in multiple areas means
storing more data for longer period of time
– Gigabytes, Terabytes, Petabytes …...
Zetabytes
● Velocity
– Machine data as well data coming from new
sources are being ingested ad speeds not
even imagined a few years ago.
– 1MB/s , 10MB/s, 50MB/s growth rate and
beyond
● Variety
– Unstructured and semi-structured data is
becoming as strategic as structured data
– Video, audio, images, free text

www.abyres.net
Page 9 / 156
Uncaptured & UnanalyzedData – AMissed Opportunity
All organizations have data lying around, either not yet
captured, poorly captured, or captured but not analyzed.
The data may contain hidden gems for improving decision
making, leaving them alone is a missed opportunity

www.abyres.net
Page 10 / 156
Big DataProcessing TechnologiesareNOT BigData
● It is a common misconception that if one is adopting Hadoop, Spark, etc, they are adopting Big
Data. However, this is not true.
● Big Data is that massive amount of data you either have collected, or have opportunity to
collect, which you unable to collect and process, due to computing limitations or cost limitations.
● Adopting Hadoop, Spark or whatever Big Data technologies, without a strong data collection
and analysis strategy will not give you the benefits that you might want.
!=

www.abyres.net
Page 11 / 156
How BigISBig?
● A common question – How big should my data be, for it to be considered Big Data?
● The answer is pretty subjective depending on organization, data, analytical processes, and
outputs you are dealing with. But ask yourself these questions:
Do you have a Big
Data problem?
Is your current data
architecture /
infrastructure, able to
collect and produce the
output you require, in
timely manner?
Do you plan to collect more
and more data, terabytes and
beyond, with the goal of
analyzing them in very rapid
manner, and you want to
archive the raw data for long
period of time?
Yes No
You are likely not
dealing with a Big Data
problem
No Yes You are likely dealing
with a Big Data problem,
or just optimization
problem

www.abyres.net
Page 12 / 156
data– the new oil

www.abyres.net
Page 13 / 156
Basicconceptsof petroleum mining
Petroleum reserve
exists in wells and
shale
Mining equipment
extract petroleum
from wells
Petroleum pipelines
transport petroleum
to silos and refineries
Refineries refine petroleum
to create petroleum based
products for consumers
Silos stores petroleum before
they are processed
Petroleum engineers design, construct, and maintain
petroleum mining, pipeline, silo and refineries
Petroleum scientists research
on petroleum to create new
products and applications using
components in petroleum

www.abyres.net
Page 14 / 156
DATAMining & Analytics
Data exists in
environment
Sensors and
data collection
software extract
data from
environment
Ingestion / ETL data
pipelines bring raw
data to central
data repository
Data refineries / processing software
processes data to extract analytical
results for use by data consumers
Data repositories / databases
stores data for analytics
purposes
Data engineers design, construct, and maintain
data mining, pipeline, repositories and processing
infrastructure
Data scientists research
on data to create new
products and applications using
analytical results from data
We are your data
engineers!

www.abyres.net
Page 15 / 156
Handling Big Data :Data Science Vs DataEngineering
● Data Science / Data Analysis
– Extract value from data
– Descriptive/Predictive/Prescriptive Analytics
– Unstructured data analysis
– Domain expertise
– Skills: Statistics, R, Python, Spark ML, Weka, Scala,
etc
● Data Engineering
– Infrastructure, technologies and expertise to handle
Volume, Velocity, Variety of data
– Data pipelining, ingestion, scheduling and pre-
preparation
– Job / Query optimization, parallel processing, data
processing automation
– Dashboards & Data Applications
– Hadoop, YARN, NiFi, NoSQL, Python, MapReduce,
Java,

www.abyres.net
Page 16 / 156
Profile OFA DataScientist
Math & Statistics
●
Machine learning
●
Statistical modeling
●
Experimental design
●
Bayesian interference
●
Supervised learning: decision trees,
random forest, logistic regression
●
Unsupervised learning: clustering,
dimensional reduction
●
Optimization gradient descent and
variants
Domain Knowledge & Soft Skills
●
Passionate about the business
●
Curious about data
●
Influence without authority
●
Hacker mindset
●
Problem solver
●
Strategic, proactive, creative,
innovative and collaborative
Programming & Database
●
Computer science fundamentals
●
Scripting language, eg: Python
●
Statistical computing package, eg: R
●
Databases: SQL and NoSQL
●
Relational algebra
●
Parallel databases and parallel query
processing
●
MapReduce concepts
●
Hadoop and Hive/Pig
●
Custom reducers
●
Experience with xaaS like AWS
Communication & Visualization
●
Able to engage with senior management
●
Story telling skills
●
Visual at a design
●
R packages like ggplot or lattice
●
Knowledge of any of visualization tools,
eg: Flare, D3.js, Tableau

www.abyres.net
Page 17 / 156
Profile OF ADataEngineer
Math & Statistics
●
Machine learning
●
Statistical modeling
●
Experimental design
●
Bayesian interference
●
Supervised learning: decision trees,
random forest, logistic regression
●
Unsupervised learning: clustering,
dimensional reduction
●
Optimization gradient descent and
variants
Domain Knowledge & Soft Skills
●
Passionate about the business
●
Curious about data
●
Influence without authority
●
Hacker mindset
●
Problem solver
●
Strategic, proactive, creative,
innovative and collaborative
Programming & Database
●
Computer science fundamentals
●
Scripting language, eg: Python
●
Statistical computing package, eg: R
●
Databases: SQL and NoSQL
●
Relational algebra
●
Parallel databases and parallel query
processing
●
MapReduce concepts
●
Hadoop and Hive/Pig
●
Custom reducers
●
Experience with xaaS like AWS
Communication & Visualization
●
Able to engage with senior management
●
Story telling skills
●
Visual at a design
●
R packages like ggplot or lattice
●
Knowledge of any of visualization tools,
eg: Flare, D3.js, Tableau

www.abyres.net
Page 19 / 156
Computing,before ‘Big Data’Era
Storage
Applications only
selectively collect data
necessary for their
core functionality,
discarding the rest.
Due to technological limitations, applications mostly store recent
data, and process them to generate relatively simple reports. Old
data are regularly purged for performance and cost reasons.
Complex and intense processing dealing with massive amount of
data requires big, expensive mainframes or supercomputers
Reports and analytical
outputs are limited to
low frequency
processing (eg: daily,
monthly) due to
computing limitations.

www.abyres.net
Page 20 / 156
Google– Pioneer of Big Data
All public
websites in the internet
Process & Index
Search service
* generalization / high level, not actual architecture
Google spiders / Googlebots crawls the internet,
capturing each and every web page it can capture,
and bring the data into their internal data storage
and processing infrastructure
Google backend processing engines regularly
process and update website index, and rank
websites using their proprietary Google Page
Rank algorithm, and then provide fast,
searchable index of the whole internet to end
users

www.abyres.net
Page 21 / 156
Google Solution(Pre-2003) :GFS+MAPREDUCE
Google
File System Map/ReduceGoogleBots
Google
Search
Engine
Web page data are collected and
stored in a distributed datastore,
across lots of commodity hardware
MapReduce framework analyzes,
transform, and rank web pages en-masse
in periodic manner, before sent for
indexing in Google search engine cluster
Search

www.abyres.net
Page 22 / 156
NUTCH Project (2002) – Attempt to createanOpen Source WebSearchEngineInfrastructure
? ?
Nutch
Crawler
?
The Nutch project was attempting to build a full scale web search engine from crawler to
indexing, however, back then, they only had a web crawler, and have yet to solve the storage
and processing problem for the data gathered.
?

www.abyres.net
Page 23 / 156
Google ReleasedGFS/ MapReducePaper - 2003-2004
● Google released GFS (late 2003) and
MapReduce (late 2004) papers to the
community, describing the architecture they
use to store and manage distributed data in
Google, and how they process them in
distributed manner.

www.abyres.net
Page 24 / 156
Nutch Distributed Filesystem+NutchMapReduce (2004-2005)
Nutch
Crawler
?
Being having the goal of creating a search engine, the Nutch project picked up both GFS and
MapReduce papers and develop their own implementation of both the technologies, as Nutch
Distributed File System (NDFS) and MapReduce.
?
NDFS MapReduce

www.abyres.net
Page 25 / 156
Hadoop Project Branched OutfromNutch Project
2006, Hadoop project split out from Nutch project,
to provide a specialized, affordable solution for
storing and processing massive amount of data
using commodity hardware. The open source
nature of Hadoop helped spark the move towards
Big Data processing in the whole industry by
providing affordable solution for massive data
processing.
Now everybody can compute massive amount
of data!!
HDFS MapReduce

www.abyres.net
Page 26 / 156
MapReduce Paper alsoinspired other technologiesfollowingitsarchitecture
Some existing database technologies,
such as MongoDB and some
PostgreSQL flavors, also adopts
MapReduce internally for computing
distributed data in its cluster
Various programming languages also
have their own libraries that implement
MapReduce as a distributed computing
algorithm, not necessarily on Hadoop

Impact of Big Data adoption to
Data Analytics practice

www.abyres.net
Page 28 / 156
StagesOf Organizational DataGrowth
* source: Teradata

www.abyres.net
Page 29 / 156
6Sigma – dataDrivenDecisionMaking
Supported by data from
data analytics practice

www.abyres.net
Page 30 / 156
BusinessIntelligence vs BusinessAnalytics
Business Intelligence Business Analytics
What it do? Reports on what happened in the past or what is
happening in now, in current time.
Investigate why it happened & predict what may
happen in future.
How it is
achieved?
- Basic querying and reporting
- OLAP cubes, slice and dice, drill-down
- Interactive display options – Dashboards,
Scorecards, Charts, graphs, alerts
- Applying statistical and mathematical techniques
- Identifying relationships between key data variables
- Reveal hidden patterns in data
What does your
business gain?
- Dashboards with “how are we doing” information
- Standard reports and preset KPIs
- Alert mechanisms when something goes wrong
- Response to “what do we do next?”
- Proactive and planned solutions for unknown
circumstances
- The ability to adapt and respond to changes and
challenges

www.abyres.net
Page 31 / 156
Componentsof DataAnalytics
Descriptive
Diagnostic
Predictive
Prescriptive
What happened?
Why it happened?
What will happen?
What to do next?
Machine
Learning
OLAP
Statistics
Artificial
Intelligence
Data
Mining
Deep
Learning Knowledge
Base
Datarequirementincreases

www.abyres.net
Page 32 / 156
Descriptive
Diagnostic
Predictive
Prescriptive
Datarequirementincreases
Datascaleincreases
Computingpowerneedsincreases
Cost ofDataAnalytics
● As we go up the chain, from descriptive to
prescriptive, we would require more data to analyze in
order to compute the outputs
● Historically, only those who can afford supercomputer
and large mainframes can get into advanced
predictive & prescriptive analytics in their business by
analyzing their data assets.
– To those who can’t afford such advanced technology,
computation takes a long time that it become impractical
to apply in business
● With Big Data adoption, several barrier were removed:
– It become easy for programmers to write computation
algorithms across hundreds of commodity hardware
– Existing algorithms that used to only able to run in single
computer are ported over for distributed computing
– Cloud based architectures allows usage-based costing
with minimal to no upfront cost
– Big Data on open source technologies removes upfront
software cost for the technically savvy
– Advanced analytics become affordable for businesses

www.abyres.net
Page 33 / 156

www.abyres.net
Page 34 / 156
Simpler Data Collection=More Data Collection=BigGer Data
Raw Data ETL Job Transformed
Data
Raw Data Ingestion
Job
Raw Data
Replica
● Traditional Flow (ETL)
– ETL flow need to be developed for extracting and
transforming raw data before loading into the central
data management platform
– The inherent cost of design and develop of an ETL flow
and data model prevents data from being collected early
– Enhancing data model with new sources involves
changes in ETL job which can be unmaintainable in long
run
● Data Flow In Big Data Practice (ELT)
– Instead of waiting to develop ETL flow and destination
data model, raw data are brought immediately into the
central data management platform through simpler
ingestion jobs – data collection barrier removed
– Analytics can be done either on raw data, or a
transformation job can be executed post-ingestion for
preparing data model
– However, ELT come at a cost of requiring more data
storage, but hardware is usually cheaper than
manpower

www.abyres.net
Page 35 / 156
ELT VSETL
Advantages Disadvantages
ELT ● No need for a separate transformation engine, the
work is done by the target system itself.
● Data transformation and loading happen in parallel,
so less time and resources are spent (as only
filtered, clean data is loaded into the target system)
● ELT works with high-end data engines such as
Hadoop cluster, cloud or data appliances. This gives
is additional performance and security.
● The processing capability of data warehousing
infrastructure reduces time that data spends in
transit and makes the system more cost effective.
● The specifics of ELT development vary on platform
i.e. Hadoop clusters work by breaking a problem
into smaller chunks, then distributing those chunks
across a large number of machines for processing.
Some problems can be easily split, others will be
much harder.
● Developers need to be aware of the nature of the
system they’re using to perform transformations.
While some systems can handle nearly any
transformation, others do not have enough
resources, requiring careful planning and design
ETL ● Single view interface to integrate heterogeneous
data
● Ability to join data both at the source and at the
integration server with the addition of the option to
apply any business rule from within a single
interface.
● Common data infrastructure for working on data
movement and data quality.
● Parallel Processing Engine for providing exceptional
performance and scalability.
● Migration from server to enterprise edition might
require vast time and resources due to the
innumerable architectural differences in the Server
and Enterprise edition.
● No automated error handling or recovery
mechanism.
● Expensive as a solution for small or midsized
companies.

Evolution of Data Management Architectures

www.abyres.net
Page 37 / 156
File Based
Input Output
Store
Read
●
Most basic data management
architecture
●
Application read/write data
from files on disks
●
Reports are generated when
reading data from the files.

www.abyres.net
Page 38 / 156
Database
Input Output
Store
Read
Input Output
Inserts
Query
●
Most common architecture for
applications
●
Separate application and
database service / nodes
●
Database takes care of
abstracting the complexity
and optimizing the
performance of managing
file based storage
●
Application deals with
inserting data gathered,
and querying data to
create outputs

www.abyres.net
Page 39 / 156
SeparatedOLTP /OLAP Databases
Input Output
Store
Read
Input OutputInserts
Query
ETL / Sync
Input Output
Store
Read
OutputReport Queries
●
Natural path for reducing workload on
database by separating the infrastructure
for operational application use, and
analytical reporting use
●
Replica database syncs with source
database and analytical processing queries
are executed in replica and not in source

www.abyres.net
Page 40 / 156
DataWarehouse
Input Output
Store
Read
Input Output
Inserts
Query
ETL
Input Output
Store
Read
Input Output
Store
Read
ETL
●
When analytical reports are to be generated using data coming
from many data sources, a central data warehouse provides the
necessary infrastructure for cross-system analytical queries
●
Data are moved over into the data warehouse through extract-
transform-load process which normalize datasets and made it
possible for cross-system data joining to happen
●
Data marts are usually created containing more human-
understandable and domain-specific data structures for making it
easy for non-technical users to analyze data in the warehouse

www.abyres.net
Page 41 / 156
Data Lake
Data Warehouse
ModernData Architecture
(Data Warehouse +Data Lake)
Input Output
Store
Read
Input Output
Inserts
Query
ETL
Input Output
Store
Read
Input Output
Store
Read
ETL
Write
Ingest
Ingest
Input Output
Store
Read
ELT
OutputAdvanced Analytics
●
Organizations with more advanced analytical practices want to collect not
just data coming from operational databases, but also other datasets from
various sources and formats that may be generated by the applications
●
Data Lake provides a simpler architecture for gathering these datasets for
future analytical uses, and have a highly scalable platform for computing
massive data
●
Data Lake usually used together with existing Data Warehouses to leverage
its strength around structured data processing

Data Architectures For Big Data

www.abyres.net
Page 43 / 156
LambdaArchitecture
ELT
W
riteW
rite
Stream Stream
Data
Stream
All Data
Batch
Precompute
Aggregated
Views
Message
Queue
Preprocess Real Time
Aggregate
Real-Time
Aggregated
Views
Batch Layer
Serving Layer
Speed Layer
Ingest
Ingest
Query
Query
Output

www.abyres.net
Page 44 / 156
Query
Batch Processing
ELT WriteIngest Output
Characteristics Strength Weaknesses
● Scheduled or interactive processing
● Bulk activity
● Historical data or subset of historical
data
● Processing takes from seconds to hours
● Primarily analytical and reporting
processing
● Results are used by automated systems
or users
● Able to access and compute all data
for analysis
● Relatively simpler to implement
● Familiar setup as most systems are
batch
● Not suitable for frequent
queries if data is very large,
requires data flow
optimization to precompute

www.abyres.net
Page 45 / 156
Real Time Processing
Ingest Stream Stream Write Query Output
Characteristics Strength Weaknesses
● Data are processed as it come
● Deals primarily with most recent data
● Processing records takes miliseconds to
tens of seconds
● Support complex event processing and
notifications
● Results are used by automated systems
● Lower load over time due to data are
processed as it comes throughout
the day rather than bulk operations
● Immediate update to analyzed report
through out the day, allowing faster
decision making
● More difficult to develop as it
requires writing real-time
data pipeline application

Data Collection:
An important component of data analytics

www.abyres.net
Page 47 / 156
BigData TitansAredata CollectionTitans
When it comes to data collection, these companies collects
whatever they can collect from all points in their business
operations

www.abyres.net
Page 48 / 156
Data
Analyticsisdependenton inputdata
Storage
N
Output
N

www.abyres.net
Page 49 / 156
VariousSourcesof DataCollection
Click Stream
Logs
Sensor
Web /
Social Media
RDBMS
Applications
Devices
Internet
Mobile
Databases

www.abyres.net
Page 50 / 156
2StrategiesofData Collection
● Business Question Driven
– Data collected based on business needs
– Clear scope, goal and deliverables
– Manageable size
– Long turnaround time before data can be
turned into actionable insights
● Have to wait for data growth
● Advanced analytics not possible until data
grown large enough
● Collect First, Analyze Later
– Data collected as they are discovered or
required
– Builds data assets before doing data analytics
– Require initial investment for data storage
– Risk collecting useless data
– Business questions are asked against the
available data asset
● Rich data assets allows advanced analytics to
be available in shorter turnaround time

www.abyres.net
Page 52 / 156
Data IsEverywhere
● Internet of Things is about a
connected world, where
everything, is connected to
internet
– Everything is an input data
source
– Everything is an output display
● Sensors everywhere
– GPS
– Temperature
– Humidity
– Luminosity
– Audio
– Video
– etc, etc, etc
● IoT brings massive amount of
data – Big Data

www.abyres.net
Page 53 / 156
Typical IoTApplicationarchitecture
IoT Sensors
collects data and
send to application
backend in cloud
Army of servers
work together to
store data and
process data in
cloud for the IoT
Application
Analytical results
from analysis of
collected data
are provided to
customers and
users for
delivering value

Tools And Technologies for
big data

www.abyres.net
Page 55 / 156
Ecosystem
Data Collection
Data Pipelining
Data
Processing
Data Storage
Data
Visualization
Data Serving

www.abyres.net
Page 56 / 156
DataCollection
● The starting point of accumulating data assets
● Measure or capture environment variables or
state into digitalized data
● Tools/Equipment includes, but not limited to:
– Any programming language
● write out application states as logs
– Web scrapers
● Scrapy / Portia
● FMiner
● Outwit
● Mozenda
● Capterra
– Sensors equipment
● RaspberryPi
● Arduino
● Various sensor circuits
● SCADA
– RDBMS extractor
● Sqoop
● Various ETL tools
● Custom scripts
– Mobile devices
● Modern smart phones have rich array of
sensors

www.abyres.net
Page 57 / 156
DataPipelining
● Move data from sources to repositories
● Coordinate and schedule data extraction and
pre-process workflow while in-flight to data
repositories
● Tools/Equipments includes, but not limited to:
– Programming libraries in various languages
● Airflow
● Luigi
● Oozie
● etc
– Traditional ETL tools
● Talend
● Pentaho
● Oracle Data Integration
● etc
– Stream data pipeline tools
● Apache NiFi
● NodeRED
● StreamSets
● Storm
● Kafka Connect
● etc

www.abyres.net
Page 58 / 156
DataStorage
● Store and archive data for short and long term
use
● Work together with processing infrastructure to
extract insights from data by providing optimized
data structure
● Tools / Equipment includes but not limited to:
– Software defined distributed storage
● HDFS
● GlusterFS
● Ceph
● ZFS
● etc
– Databases
● PostgreSQL
● Oracle
● MSSQL
● etc
– NoSQL datastores
● MongoDB
● Elasticsearch
● Solr
● Neo4j
● HBase
● Redis
● etc
– Message Queues
● Kafka
● RabbitMQ
● Redis
● etc

www.abyres.net
Page 59 / 156
DataProcessing
● Process and compute data to extract value
and insights
● Process data either in batch or real time,
ideally in distributed manner
● Provides algorithms for complex computations
● Tools / Equipments includes, but not limited
to:
– Any programming languages, especially R,
Scala, Python
– Distributed batch processing engines
● MapReduce
● Tez
● Hive
● Pig
● Spark
– Distributed stream processing engines
● Storm
● Celery
● StreamParse
– Traditional ETL tools
● Talend
● Pentago
● Oracle Data Integration

www.abyres.net
Page 60 / 156
DataServing
● Serve processed data in for high
performance analytical queries
– Utilizes highly optimized data structures for
purpose-specific queries
● Tools / Equipments includes, but not limited
to:
– High performance OLAP
● Druid
● Kylin
– Graph data stores
● Neo4j
● ArangoDB
– Search engines
● Elasticssearch
● Solr
– Time series databases
● Graphite
● InfluxDB
● OpenTSDB
● Prometheus

www.abyres.net
Page 61 / 156
DataVisualization
● Display data summary and reports in the
form of visual diagrams and charts
● Visual data discovery and exploration
● Tools / Equipments includes
– Traditional BI / reporting tools
● Pentaho
● Jasper
● SAS
● SpagoBI
● Microstrategy
● etc
– Real-time dashboarding
● Grafana
● Kibana
– Visualization libraries
● D3.js
● DC.js
● Shiny
● Bokeh
● etc
– Visualization platforms
● Tableau
● Redash
● Superset

Understanding Open Source License &
Consumption Model

www.abyres.net
Page 63 / 156
ModernBig DataTechnologiesAreDriven ByOpen SourceCommunity

www.abyres.net
Page 64 / 156
OpenSource Software Definition
● A software which are licensed under a license that guarantees the following rights
– Free Redistribution
● The license shall not restrict any party from selling or giving away the software as a component of an aggregate software
distribution containing programs from several different sources. The license shall not require a royalty or other fee for
such sale.
– Source Code
The program must include source code, and must allow distribution in source code as well as compiled form. Where
some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source
code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The
source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated
source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
– Derived Works
● The license must allow modifications and derived works, and must allow them to be distributed under the same terms as
the license of the original software.
– Integrity of The Author's Source Code
● The license may restrict source-code from being distributed in modified form only if the license allows the distribution of
"patch files" with the source code for the purpose of modifying the program at build time. The license must explicitly
permit distribution of software built from modified source code. The license may require derived works to carry a different
name or version number from the original software.

www.abyres.net
Page 65 / 156
OpenSource Software Definition
– No Discrimination Against Persons or Groups
● The license must not discriminate against any person or group of persons.
– No Discrimination Against Fields of Endeavor
● The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may
not restrict the program from being used in a business, or from being used for genetic research.
– Distribution of License
● The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of
an additional license by those parties.
– License Must Not Be Specific to a Product
● The rights attached to the program must not depend on the program's being part of a particular software distribution. If the
program is extracted from that distribution and used or distributed within the terms of the program's license, all parties to
whom the program is redistributed should have the same rights as those that are granted in conjunction with the original
software distribution.
– License Must Not Restrict Other Software
● The license must not place restrictions on other software that is distributed along with the licensed software. For example,
the license must not insist that all other programs distributed on the same medium must be open-source software.
– License Must Be Technology-Neutral
● No provision of the license may be predicated on any individual technology or style of interface.

www.abyres.net
Page 66 / 156
Open Source Does NotMeanNoCopyright
● Open Source software are copyrighted and
not public domain
– The author retains the copyright and
intellectual property, however, the author
choose to grant licensee of the software
additional rights which normally are not
granted under proprietary license
– Any users of the software automatically
become the licensee the moment they
acquire a copy of the software
– Open Source authors usually will re-use
legal license documents already exist in the
Open Source community as the license for
his software
● Should you not comply with terms and
conditions in the license document, the
author have the rights to enforce the license

www.abyres.net
Page 67 / 156
Types OfOpen Source Licenses
Permissive Weak Copyleft Strong Copyleft
●
Most flexible
●
Derivative works are not
required to be Open
Source or using the
same license
●
eg: MIT, BSD
●
Some parts of derivative
works are required to be
using the same license
●
Usually this license is
used on a library
●
Modifications to itself are
required to be released
under same license, but
projects importing the
library are not required
to be using same license
●
eg: LGPL
●
Strict enforcement of
same license for any
derivative works
●
All projects importing
libraries provided by
software licensed under
this license are required
to be also released
under the same license
of the original work
●
eg: GPL, AGPL

www.abyres.net
Page 68 / 156
Consumption Model
● Common misconception of Open Source in enterprise
– Open source is free
– Open source is not licensed
– Open source comes without support
– Open source software is not stable and change too fast
● Open Source is a software development model, but not exactly a software consumption
model. The consumption model is more or less similar with other software
● Software can be free, but human time are not
Software developed in
public, possibly with
community contribution,
and many frequent
improvements
Open Source software
distribution company takes a
snapshot of codebase,
stabilize, integrate, create
support, training and warranty
model, and productize the
software
Enterprise customer buys
productized software, and
receive support, training, and
warranty from distribution
company and services from SI
Ecosystem of System
Integrator, ISV, trainers
provide professional services
and added value for using the
productized software

www.abyres.net
Page 69 / 156
UpstreamSoftware Vs DownstreamEnterprise Product
Upstream Enterprise
●
Rapid changing
●
Latest and greatest
features
●
Can be unstable
●
No warranty, no
support, or minimally
supported
●
Most of time free of
capital cost
●
Less changes over
short period of time
●
Tried and tested
features
●
Generally more stable
●
Comes with warranty,
support SLA, training
and certifications
●
Charged for support &
warranty subscriptions
and professional
services

www.abyres.net
Page 70 / 156
DoI Have touseenterprise product?
Do I have to use
Enterprise edition
of an Open Source
software?
Are going to use
it as hobby or
professional?
Are you using it
for R&D or for
production?
Do you have
the budget?
Do you have any
regulations against using
software without
warranty or internal
expertise in production?
Professional
Hobby
R&D
Production
No
Yes
No
Yes
Not necessary
Required
Recommended

A Framework For Organizational
Data Journey

www.abyres.net
Page 72 / 156
Big DataTransformation Journey

www.abyres.net
Page 73 / 156
Stage 1: Data DiscoveryonActivearchive
● Initial starting infrastructure for Proof of Value
– 2 master, 3 workers for batch/interactive processing
– 1 node for stream processing
● Select several datasets and ingest both current data and historical
archive which then be made available for generating analyzing patterns
over long historical context. Also ingest related tables
– eg: Touch N’ Go transactions
● Familiarize with technologies and processes involved in Big Data
● Create reports / dashboards detailing discoveries from analysis of
historical archive
● Time frame: 6-12 months

www.abyres.net
Page 74 / 156
Stage2:Data Lake
● Medium scale cluster for central data lake
– 3 master, 10-15 workers for batch/interactive processing
– 2 nodes for stream processing
● On-board more dataset from various internal sources into the data lake
to get 360 view of the organization.
– eg: CRM, ERP, Website logs, Device logs, etc
● Develop and launch reports and dashboards related to cross-dataset
relationships and patterns
– eg: 360 view of customer

www.abyres.net
Page 75 / 156
Stage3: Advanced Analytics
● Large scale cluster for complex computation
– 4 masters, 20++ workers for batch/interactive processing
– 4++ nodes for stream processing
● On-board external data sources for enrichment against internal datasets
– eg: Social media, web scrapers, IoT sensors
● Aggressive data collection and data mining as strategic direction and
asset
● Identify repetitive patterns, create model to predict them, and leverage
its use in AI-powered applications

www.abyres.net
Page 76 / 156
Stage 4: Continuousimprovement
● Continuous data-driven transformation and innovation

The Data Journey
to a Golden Batch
The Data Journey
to a Golden Batch

79 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Case Study
Merck’s Journey
Improving Life Sciences Manufacturing Yields Presents a Complex Data
Discovery Challenge
 Vaccine manufacturing requires precise control of complex fermentation
processes
 Two batches of a vaccine, produced using an identical manufacturing process,
can exhibit significant yield variances
 Batches that fail quality standards can cost $1 million each
 Data for one vaccine was stored across 16 different systems, and high storage
costs limited the length of data retention

Reserved
Merck’s Journey
The Golden
Batch
Scientific Search
Sensor Data Storage
Vaccine Yield
Optimization
Innovate
Renovate
The Journey to
the Golden Batch
 Combined 10 years data on
one vaccine: 1 billion records
 5.5 million batch comparisons
 1st
year yield boost of 40K more
doses  $10M profit impact
 McKinsey: 50% yield increase
Epidemiology

The Data Journey
to Safe Roads
The Data Journey
to Safe Roads

Reserved
Case Study
Progressive’s Journey
Progressive Wanted to Ingest IoT Data to Predict Risk for its Usage-
based Insurance Product
 Progressive Snapshot offers usage-based insurance through an in-car
sensor that transmits IoT driving data
 Sensors collect up to six months of data from drivers and the data is
archived for years, per regulatory requirements
 Progressive’s existing systems were not scaling efficiently
 It took 5–7 days to transform only 25% of available UBI data

Reserved
Progressive’s Journey Rewarding Safer Drivers
and Improving Traffic
Safety
 Snapshot plug-in devices capture
driving detail
 Progressive stores more than
10 billion miles driven
 Through a web app, customers can
review their own driving detail and
improve their safety
 Snapshot and usage-based
insurance drove $2.6 billion in
2014 Progressive premiums
Innovate
Renovate
Safe Roads
Claims Notes
Mining
Individual
Driving
Histories
Usage-Based
Insurance (UBI)
Web Log
Analysis
Online Ad
Placement
Sensor Data
Ingest

The Data Journey
to Better Health
The Data Journey
to Better Health

Reserved
Case Study
Mercy’s Journey
Mercy Medical System Sought a Data Lake for a Single View of its Patients – “One
Patient, One Record”
 Existing platform impeded goal of enriching Epic data for 1 million patients
over 35 Hospitals and 500 clinics
 Moving Epic EMR data to Clarity EDW took 24 hours and was “never going
to enable real-time analytics”. Now that takes 3-5 minutes with HDP.
 Improved billing processes resulted in $1M additional annual revenue
from newly documented secondary diagnoses and care

Reserved
Case Study
Mercy’s Journey
Better HealthBilling Vital Sign
Monitoring
Single
Patient
Record
Lab Notes
Archive
Privacy
Database
Medical
Decision
Support
Device
Data
Ingest
Preventive
Care
Epic
Enrichment
OPEX
Efficiency
Epic EMR
Replication
Innovate
Renovate
Better Health
Through Data
 Searches of free-text lab notes,
speed researcher insight from
“never” to “seconds”
 Ingest of ICU vital signs
increased by 900X, letting
clinicians respond more quickly
 Mercy is building real-time
tools to support surgical decisions
and preventive care

Reserved
Webtrends
The Data Journey Towards
Personalized Online Ads
Webtrends
The Data Journey Towards
Personalized Online Ads

Reserved
Massive Volumes of Weblogs Fueled Webtrends
Growth—and also its Skyrocketing Storage Costs
Webtrends’ Journey
 Webtrends provides digital marketing solutions for more than 2,000 companies
in 60 countries – processing 13 billion daily online events
 Data used to be processed in relational databases, stored on large NAS
appliances, which were not economical at scale
 Processing occurred on-premises, without cloud-based capabilities
 Diseconomies of scale hampered the company objective to help its customers
predict optimal online ad placement

Reserved
Webtrends’ Journey
Personalized
Online Ads
Per-Customer
Click Path
Web Log
Analysis
SQL Server
Offload
“We’re able to…look at this data set and process it and do predictions, behavioral analysis.
We can do things that allow us to determine ROI for different actions and behavioral
patterns.”
Peter Crossley, Chief
Architect
Behavioral
Segmentation
Ad Click
Predictions
LCV
Analysis
Innovate
Renovate
Petabytes of Weblogs
Analyzed with Spark
at Scale
 Data streams from a vast array
of desktop and mobile devices
 13 billion daily events collected in
fewer than milliseconds per event
 No data cleansing necessary prior to
analysis with Apache Spark
 2 clusters consolidated into 1 YARN-
based HDP cluster
 Launched new product Webtrends
Explore™ – powered by HDP

Reserved
Watch The Webtrends Videos
https://youtu.be/hwpGj57VGz0
https://www.youtube.com/watch?v=LifVwIwN61E

The Data Journey
for Cyber Security
The Data Journey
for Cyber Security

Reserved
Symantec’s Journey
Analyzing Streaming Threat Data to Increase Velocity for Time to Protection
 The Symantec™ Global Intelligence Network includes more
than 57 million attack sensors over 157 countries
 Data streams from 75 million users on 120 million devices
 Legacy platforms created 3-4 hour processing latencies to
analyze logs files for digital threats
 Attackers could exploit those processing time windows

Reserved
Symantec’s Journey
Digital
Security
Metadata
Capture
Threat
Predictions
Attacker
Detection
Unified
Security
Security Log
Analysis
Threat
Archive
Device Data
Ingest
Threat
Detection
Greenplum
Offload
Innovate
Renovate
Data Science Speeds
Time to Protection
 Threat detection latency reduced
from 4 hours to 2 seconds
 Time to protection improved 5000x
 Machine learning over tens
of petabytes of historical data
predicts threats to customers
 Cloud team uses Ambari and
Cloudbreak for dynamic clusters
to meet peak workloads

Reserved
The Data Journey to
Secure Telco Networks
The Data Journey to
Secure Telco Networks

Reserved
Neustar’s Journey
Neustar’s Telco Network Analytics Business
was Limited by High Data Storage Costs
 Neustar offers its telecommunications customers Network Analytics
services, but faced a 2011 cost of $100,000 per terabyte of storage
 It could only economically capture 10% of the data flowing through
its networks, retained for 60 days
 Neustar CEO challenged her data warehousing group to retain 100%
of the network data for at least one year

Reserved
Secure telecom
networks
Single View
of the
Network
Network Data
Storage
Proactive
Network
Protection
Enriched
App Data
DDoS Attack Mitigation
Rapid Threat
Response
New Info
Services
Neustar’s Journey
Innovate
Renovate
Architecture Renovation
Funded Service Innovation
 Cost per terabyte reduced from
$100K to under $250
 100% of data now retained, growing
storage capacity 150X
 Data retention extended from 60 days to
2 years
 Elimination of existing support fees saved
millions annually
 New data assets help Neustar grow
its product portfolio

Reserved
The Data Journey to a
Balanced Supply Chain
The Data Journey to a
Balanced Supply Chain

Reserved
Cardinal Health’s Journey
 Cardinal Health supplies equipment and medicines to 85% of US hospitals and clinics
 Limited visibility into the entire supply chain prevented suppliers from understanding how
their drugs were prescribed
 Acute pharmacists couldn’t see all the product options that they could prescribe for various
conditions
Data Ingest Constrained Analysis of the Medical Supply Chain at Fuse
by Cardinal Health

Reserved
Cardinal Health’s Journey
Balanced Medical
Supply Chain
Drug Supply
Chain Analytics
Sensor Data
Ingest
Prescription
Archive
Pandemic
Response
Outcome-based
Medicine
Clinical Decision
Support
Public
Data
Ingest
Drug Cost
Optimization
Single
Patient
Record
Cardinal Health
Launched a New Line
of Business
 Fuse by Cardinal Health aims to
make healthcare safer and more
cost-effective
 Team enriches supply chain data
with public sources – bringing
suppliers, providers and patients
closer together
 Data processing speeds doubled
 Fuse shows suppliers how their
drugs are used
Innovate
Renovate

Reserved
Anonymous Case StudiesAnonymous Case Studies

/ 156
Creating Opportunity
Data: Clickstream &
Server Log
Online Ad Placement Analytics for Mega-Retailer
Problem
Digital ad firm unable to connect impression & click data
• One of the world’s largest retail websites made guesses about online ad placement based
on Google analytics
• Clickstream data flowed in 100s of MB per hour and billions of rows per month, this data
strained existing architecture
• Inability to connect ad impressions to clicks to purchases
• No ability to detect browsing device, geo-location, or whether customer was in the store
Solution
Unified web tracking data repository provides 360-degree view of online behavior
• Impression files and click files are stored in the same data lake, and easily joined for
customer insight
• With better targeting, fewer ads can be placed, improving overall customer web experience
• Social media data will be added for brand sentiment analysis
Advertising
Manages online media
programs for retail e-
commerce websites
AD1

/ 156
Monetize Anonymous & Aggregate Banking Data
Problem
Valuable banking data needed to be anonymous & unified
• Bank possesses data that indicates larger macro-economic trends, which can be
monetized in secondary markets
• Regulations and company policies protect customer privacy
• Data sets are isolated in legacy silos controlled by LOBs
• IT challenged by joining data while guaranteeing anonymity
Solution
Cross-bank data lake for aggregate data with secure access
• Multiple data sets abstracted from source platforms
• Single point of security & privacy for de-identification, masking, encryption,
authentication and access control
• Mortgage bankers, consumer bankers, credit card group and treasury bankers have
access to the same cross-sell data
• Interoperability with partners SAS, R, RedHat & Splunk
• Economies of scale for compression & archiving data
• Significant reduction in storage costs from prior platforms
Data: Structured,
Clickstream, Social &
Unstructured
Banking
One of the largest US banks
BK1

/ 156
Sensor Data Monitors Buildings for Efficiency
Problem
Managing service calls on HVAC in commercial buildings
• More than 70K systems in buildings around US
• Systems transmits data, but mostly kept on site or discarded
• Servicing costs high, due to limited data on each unit
• Data on work orders, sales orders, service orders stored in different databases
and not correlated
Solution
Data consolidation and predictive analytics for efficiency
• Raw data from HVAC sensors will land in HDP, along with work order, sales
order and service call data
• System will predict component failures for:
– Product upsell  increased revenue
– Service call efficiency  reduced costs
• Management insight for a new service offering
Improving Efficiency
Data: Sensor
Building
Management
Building efficiency and
power solutions
>$420B in revenue
>140 employees
BM1

/ 156
Sensor Data From Smart Electricity Meters
Problem
Utility needs to match electricity supply with demand
• Utilities cannot store power, it needs to be used
• Some energy load is predictable, some is unpredictable
• Overproduction requires cutting back, running below capacity
• Underproduction risks starting less efficient “peaker plants”
• Smart meter data allows real-time analysis that can help effectively match
energy production with consumption
Solution
Predict demand spikes by analyzing real-time sensor data
• Hive + Storm on YARN streams data into Hadoop
• R + Mahout to analyze aggregate consumption trends for predictive algorithms
• More effective matching of energy production and consumption reduces energy
costs and emissions
Data: Sensor
Energy
One of the world’s largest
producers of electricity
>$100B in revenue
>39 million customers
>150K employees
EN1

/ 156
Proactive Oil Field Decisions for Pump Equipment Utilization
Problem
Limited visibility into utilization of pump equipment
• Oil field services: exploration, drilling, well construction & production optimization
• Company manages huge base of costly equipment in the field, in 80 countries
• Time consuming, manual effort required to collect & analyze pump equipment data
• Standard data warehouse model & traditional reports did not scale well & yielded
incomplete results
Solution
Combine structured data, sensor & log data for proactive equipment decisions
• Reduces manual time and effort to collect & analyze data from sensors above and
below ground, as well as log data from pump trucks
• Big Data project runs in Accenture Cloud, with Accenture providing data architecture,
data science and project management services
• Project integrated with embedded technologies from Hortonworks technology partners:
Microsoft, SAP & HP
• Project goal: reduce equipment expense and improve margins
Data: Structured, Sensor &
Server Log
Energy
Major provider of upstream oil
field services
>$29B in revenue
Operations in 80 countries
>75K employees
EN2

/ 156
Powering Music Recommendations
Problem
CDH cluster failed, causing down time
• Highly technical team was running CDH cluster, without support
• CDH failed, CTO asked team to research support options
• Hive table stores data on all music streamed by users
• Data on Hive is mission-critical: used to recommend music & to pull monthly
reports used to pay each music label
• Data expertise is their only sustainable competitive advantage
Solution
HDP powers music recommendation engine
• Stable recommendation engine and reconciliation reports
• Pro-active technology partnership with their engineers, who are consumers of &
contributors to Hadoop
• 2X per year, Hortonworks reviews cluster for optimization
• Data was migrated from CDH to HDP, quickly and easily
Data: Clickstream
& Server Log
Entertainment
Online music streaming
>$500M in revenue
>24M users
ET1

/ 156
Donor and Voter Analytics for a Political Organization
Problem
Limited insight into donor behavior & voter mobilization
• Fundraising phone services lack analysis on why donors give
• For campaign management, needed analysis on what factors cause constituents to
register and vote
• Client knew they needed Hadoop for storage and analysis
• Needed education on roadmap, use cases and execution
Solution
Donor data store improves revenue from tele-fundraising
• Speed: Rapid delivery of donor data store
• Deployment flexibility: Runs in Windows environment
• Targeted: Phone reps talk to donors about their important issues
• Discovery: Explore and enrich data from campaign operations
Data: Unstructured
Fundraising
Political organization
dedicated to tele-
fundraising, voter contact
and media services
>$1M in revenue
~100 employees
FR1

/ 156
Analysis of Gamer Data for Future Innovation
Problem
Social gaming platform needs more storage, more stability
• 4 million monthly gamers generate customer interaction data
• Existing CDH cluster was going down every month
• Desired tight integration with Datameer analytics tools
• Needed interactive query, Impala was not meeting that need
• Rapidly growing user base, need to manage cluster as it scales
Solution
HDP for stability at scale, tight integration with Datameer
• Stable cluster that doesn’t fall down like CDH did
• Easy data extracts from SQL server
• Datameer analytics tools certified on HDP
• High-performing Hive queries
• Ambari for provisioning and maintenance as cluster scales up
Data: ETL
Gaming
Online strategy & role
playing games
~4M users
~$325M in revenue
~500 employees
GM1

/ 156
Gamer Migrates a Homegrown Cluster to HDP
Problem
Social gaming platform used Hadoop, but needed support
• Social gaming platform built its own Hadoop cluster
• Heavy users of Hive for analysis of player behavior
• Hadoop analysis informed strategy to prolong length of play, purchase virtual
goods and respond to timed in-game events
• Heavy processing needs and ~1 petabyte of data outpaced the company’s ability
to support and extend its in-house cluster
Solution
HDP functionality + Hortonworks support = better games
• Easy migration from native Hadoop cluster preserved data and processing tools
• HDP cluster includes a more complete ecosystem: Ambari, Flume, HBase, Hive,
Oozie, Pig, Sqoop, Storm, ZooKeeper
• Social media sentiment analysis combined with data on player stats and behavior,
used to improve games their revenue
Data: Clickstream,
Server Log, Social & ETL
Gaming
Social gaming
~5M users
>$100M in revenue
~500 employees
GM2

/ 156
Clearing the Federal ETL Consulting Backlog
Problem
Federal consulting practice faces ETL backlog
• Sequestration budget cuts created demand for ETL from SAS
• Consulting practice faces backlog of millions of dollars consulting on offload from
SAS at 20 fed civilian agencies
• After offload, all data must still be easily accessible
Solution
Rationalized data storage saves taxpayer money
• Federal civilian agencies reduce ongoing data storage cost
• No loss of data or disruption to operations
• Base SAS and SAS/ACCESS are two out-of-the-box solutions for connectivity
between SAS and Hadoop, via Hive
Data: ETL
Government
Professional service
provider consulting on
federal projects
>$13B in revenue
>50K employees
GV1

/ 156
Processing Time-Sensitive Employment Reporting w/ Confidence
Problem
Agency reporting on labor data has 9 working days to prepare report
• Agency reports on inflation, pay and benefits, unemployment levels, labor productivity
• Agency’s monthly employment report moves financial markets
• State agencies report unemployment data to federal office by first Friday of the month
• Total data set is hundreds of millions of rows in 30 comma-separated files
• If team finds errors in state data, it may take days to correct with the state affiliate
• Final report must be published by the third Friday of the month, time is precious
Solution
HDP speeds processing and improves confidence in unemployment findings
• Hortonworks partner OpenOsmium introduced Hortonworks to client team
• Federal budget pressures created favorable policies towards open source software
• POC pilot: processing one of thirty files on HDP/Amazon Cloud solution
• Processing time reduced from 18 hours to less than 1 hour
• Absolutely no disruption to existing systems or operations
• Cloud cluster runs on “as needed” basis, shut down remotely when not needed
Data: Structured
Government
US federal government labor
agency
GV2

/ 156
Sentiment Analysis for Government Programs
Problem
Min. of Ed. felt removed from public sentiment on programs
• In-person events lacked reach and persistence
• Ministry of Education wanted to understand sentiment from citizenry on specific issues
such as childhood obesity
• Two dedicated analysts pored over social media stream and provided daily reports to
member of parliament
• IT team sought improvement over limitations of manual analysis
Solution
Powerful “same day” sentiment analysis helps outreach
• Team produces daily memos on public sentiment, now with:
– Reach: includes opinions from broader base of citizenry
– Confidence: more data, more confidence in opinion analysis
– Frequency: daily reads show policy-makers changes over time
– Precision: allows micro-analysis of specific issues and geos
• Solution aligns to government’s support for open source
• Individual social media authors receive invitations to in-person meetings with government
ministers
Data: Social
Government
European national
government
GV3

/ 156
Sensor Data for Healthcare Supply Chain
Problem
Medical products have limited shelf life, tracking essential
• Medical products delivered to pharmacies and hospitals
• Epidemics require agile changes to delivery schedules
• Materials are time sensitive and climate-controlled
• Delivery logistics are complex & subject to risks outside of the company’s control
(product availability, weather, traffic, etc)
• Slow delivery can harm supplies and medical outcomes
Solution
Sensor data protects supply chain, improves efficiency
• Sensor data from individual items and vehicles will give the company
unprecedented supply chain visibility
• Analytic platform enable predictive algorithms for infrastructure planning, disease
forecasting and supply chain forecasts
• Better tracking reduces waste, improves customer confidence and patient health
Data: Sensor
Healthcare
Supplier of
pharmaceuticals & medical
products to pharmacies &
hospitals
>$100B in revenue
>30K employees
HC1

/ 156
Predictive Analytics & Real-time Monitoring of Vital Signs
Problem
Unable to store sufficient data for decision support
• 22 years of data for 1.2 million patients ~ 9 million records
• Data on legacy system was not searchable nor retrievable
• Cohort selection for research projects was slow
• For decision support, clinicians had minimal access to historical data gathered
across all patients
Solution
Unified repo provides data to both researchers & clinicians
• “View only” legacy system retired, saving $500K
• 9 million historical records now searchable & retrievable
• Records stored with patient identification for clinical use, same data presented
anonymously to researchers for cohort selection
• Real-time monitoring: patches record vital signs every minute, algorithms notify
clinicians if numbers cross risk thresholds
• Readmit reduction: heart patients weigh themselves daily, algorithms notify docs
about unsafe weight changes
Data: Sensor, Social
& ETL
Healthcare
Public university teaching
hospital
Consistently rated by US
News & World Report as
among America’s best
hospitals
>17K patient admissions
>400 physicians
~12K surgeries (‘12)
HC2

/ 156
Affordable, Scalable Data for Healthcare Analytics
Problem
Relational database architecture limited data exploration
• Develops and maintains analytic applications for doctors
• Company couldn’t access the volume or variety of data they wanted for those
applications
• Analyzing huge data sets on relational databases was too slow
Solution
HDP reveals new big data insights, with costs savings & flexibility at scale
• Link and access new types of data that are currently outside of the healthcare
domain such as: pharmacy receipts, text messages or patient web searches
• Per-node TCO of data on HDP was 25% that of current relational DB
• Open-source Hadoop ecosystem gives multiple hardware and software
integration options as company scales its architecture
Data: ETL
Healthcare
Analytics tools and
decision support for the
healthcare industry
~$130M in revenue
>2K employees
HC3

/ 156
Rapid Detection & Intervention for Stroke Prevention
Problem
Conditions appearing to be strokes delay short windows for critical intervention
• Some conditions show stroke-like symptoms (e.g. migraines or muscle spasms)
• Stroke neurologists spend 50% of their time with non-stroke patients
• Transient ischemic attacks “TIAs” are mini-strokes that present like migraines, but are
highly predictive of future full-blown strokes within the following days
• Incomplete or slow access to patient data hampers clinician’s ability to respond promptly
to TIAs
Solution
HDP unifies present day images with historical data to quickly identify TIAs
• Patient contact records (calls to the province’s health hotline) merged with population
historical records and present-day medical images improve diagnosis
• Algorithms on population risk factors (weight, age, cardiovascular problems) are mined
for probability that a given patient has similar risk factors
• With quantified risk factors, doctors quickly identify those at risk of imminent stroke
• Prescriptions of blood thinners, exercise and diet reduce incidence of those strokes
Data: Sensor, Unstructured &
Structured
Healthcare
Top Canadian research
university, researching
epilepsy, stroke care and
brain surgery outcomes in
government-run healthcare
system
HC4

/ 156
Management of Chronic Health Conditions Such as Epilepsy
Problem
Epilepsy is a chronic, unpredictable & difficult to treat condition
• Epilepsy can go undiagnosed while seizures are minor
• Epileptics are at higher risk of depression, making condition more difficult to manage
• Tabular data is gathered through treatment at epilepsy specialty clinics
• Additional tabular data in the system is difficult to combine for a complete picture
• Social data on patient behavior is unavailable for combination with tabular data on
clinical history and pharmaceutical prescriptions
Solution
HDP healthcare data lake joins disparate data, for better disease management
• Data lake for a 360-degree view of the patient: electronic medical records, history of
clinic visits, Facebook, Twitter & sentiment survey data
• Regular, patient self-reporting with targeted surveys via mobile and web applications
• Dynamic calculation of changing sentiment scores useful for proactive outreach
• Clinicians will have ability to reference current psychographic & sentiment data
immediately before (and during) the patient’s scheduled clinical visits
Data: Social & Structured
Healthcare
system
HC5

/ 156
Robotics & Real-Time Decision Support in Brain Surgery
Problem
Brain surgeons make real-time decisions using only a fraction of available
data
• Brain surgeons may spend hours working (non-destructively) through brain tissue
• For aneurisms, they must clamp a weak point in a specific blood vessel
• Surgical assistant presents ~100 clamps, which surgeon uses until finding a good
fit
• Clamps exposed to surgical environment are discarded at a cost of
$100K/surgery
• Time selecting/testing clamps can negatively affect surgical outcomes
Solution
Robots, streaming video & surgery inside an MRI with real-time decision
support
• Researchers developed non-magnetic robots that surgeons control within an MRI
• Constant streaming of MRI imaging helps decisions while surgery is underway
• Recordings of MRI data stored in Hadoop, analyzed w/ machine-learning
algorithms
• MRI images compared to surgical outcomes for insight into best practices
Data: Sensor, Unstructured &
Structured
Healthcare
system
HC6

/ 156
Data Science on Text-based Health Claim Records
Problem
Claims data in PDFs, hard to identify coding errors
• Produces applications for medical decision support
• Goal is marrying electronic health records with claims data
• 300K daily connections with individuals around unstructured data in PDFs (claims
records and patient-reported outcomes)
• Data analysis is disjointed, difficult to identify patients and events that have been
mis-coded or incompletely coded
Solution
Datasets unified in Hadoop to improve health outcomes
• Optical character recognition & natural language processing
• All of the unstructured, text-based data stored on HDP
• Coding errors will be identified much more efficiently
• Impartially coded records can also be identified
• Coding efficiency will improve revenue
• Analysis of underlying data will improve health outcomes
Data: Unstructured
Insurance – Health
Large US medical insurer
>$100B in revenue
>100K employees
IH1

/ 156
Insurance Data Lake to Manage Risk
Problem
Challenges merging new & old data hamper analysis
• Traditional and newer types of data were both growing quickly but were difficult
to combine in the EDW
• “Schema on load” requirements of EDW platform limited ingest of some data
with significant predictive power
• Company missed data-driven ways to serve customers
• Process of separating legitimate from fraudulent claims created “needle-in-a-
haystack” problem
Solution
Common platform for all types of data improves up-sell and reduces fraud
• “Schema on read” Hadoop architecture means that more data sources can be
easily ingested to enrich predictive analytics
• Agents use big data insights to determine the best action for valued customers
and recommend those in real-time
• Claims analysts and underwriters process streaming data to quickly flag fraud
risks and fast-track legitimate claims
Data: Structured,
Clickstream, Server Log
Insurance – Health
Large US medical insurer
>$30B in revenue
>20M members
~35K employees
IH2

/ 156
Speeding Analysis for Usage-Based Car Insurance
Problem
Risk analysis lagged because of architecture gaps
• Business insight from data analysis was too slow
• Growing volume, velocity and variety of incoming data taxed existing systems &
processes
• ETL process across disparate systems only captured 25% of the dataset, took
5-7 days to complete
Solution
Speed time-to-insight w/ clickstream analytics & faster ETL
• Clickstream analytics
– Moving from a hosted Azure platform to HDP on site will improve performance and
analytical functions (with Apache Hive)
• ETL acceleration
– Process 100% of the data, in three days or less
Data: Clickstream & ETL
Insurance –
Property &
Casualty
Personal auto & other
property-casualty
insurance
>$17B in revenue
~28K employees
IP1

/ 156
Data Lake for P&C Insurance Claim Analysis
Problem
Structured data scaled, unstructured data analysis did not
• Large P&C insurance provider had systems for analyzing structured data at scale
• Unstructured data from claims notes and social media data had the potential to
add valuable information to claims analysis
• Structured data analysis scaled, but joining this information with hand-written or
social media data did not scale
• Limited data visibility hampered underwriting and claims
Solution
Merge structured & unstructured data for better decisions
• “Schema on read” Hadoop architecture means that more data sources can be
easily ingested (text and social media)
• Previously disparate data sets are joined for greater insight
• Larger data sets fed to front-end business tools provided by Hortonworks
partners: SAS, Tableau and QlikView
Data: Structured, Social
& Unstructured
Insurance –
Property &
Casualty
Major provider of property
casualty, life and mortgage
insurance
>$65B in revenue
>60K employees
Operations in >100
countries
IP2

/ 156
Maintaining SLAs for Equity Trading Information
Problem
Meeting 12 millisecond SLAs for “ticker plant”
• Daily ingest: 50GB server log data from 10,000 feeds
• Four times daily, this data is pushed into DB2
• Applications query this data 35K times per second
• 70% of queries are for data <1 year old, 30% for >1 year old
• Current architecture can only hold 10 years trading data
• Growing volume puts performance at risk of missing SLAs
Solution
Meeting SLAs with confidence
• HBase provides super-fast queries within SLA targets
• ETL offloading to Hadoop allows longer data retention, without jeopardizing fast
response times
Data: Server Log & ETL
Investment
Services
Highly trafficked website
providing business and
financial information
~15K employees
IS1

/ 156
Banking Data Lake for 100s of Use Cases
Problem
Architecture unsuited to capitalize on server log data
• Huge investments company generates valuable data assets which are largely
unavailable across the organization
• Current EDW solutions are appropriate for some data workloads but too expensive for
others
• Financial log data is difficult to aggregate & analyze at scale
• Short retention hampers price history & performance analysis
• Limited visibility into cost of acquiring customers
Solution
Multi-tenant Hadoop cluster to merge data across groups
• Server log data will be merged with structured data to uncover trends across assets,
traders and customers
• ETL offload will save money for Hadoop-appropriate workloads
• Longer data retention enables price history analysis
• Joining data sets for insight into customer acquisition costs
• Accumulo enforces read permissions on individual data cells
Data: Server Log
Investment Services
Global investments company
> $1.5 trillion assets under
management
> $14B billion in revenue
~ 50K employees
IS2

/ 156
Anti-Laundering & Trade Surveillance for Investment Firm
Problem
Lags in back office system limit intraday risk analysis
• 15M transactions and 300K trades every day
• Storage limitations required archiving, limiting data availability
• Trading data not available for risk analysis until end of day, which hampers
intraday risk analytics and creates a time window of unacceptable exposure
Solution
Data lake accelerates time-to-analytics & extends retention
• Shared data repository combines more comprehensive data sets about all firm
activities, improving data transparency
• Operational data available to risk analysts earlier, same day
• Trading risk group will process more position, execution and balance data and
hold that data for five years
• Hadoop enables ingest of data from recent acquisitions despite disparate data
definitions and infrastructures
Data: Structured
Investment
Services
Trading services for
millions of client accounts
>$16B in assets
>4,000 advisors
IS3

/ 156
Data: Geolocation,
Clickstream, Server Log,
Sensor & Unstructured
Customer Insight from Consumer Electronics Product Usage Data
Problem
Lacked central repository for efficient data storage & analysis
• Rivers of data flow from millions of consumer electronic products
• Company lacked a platform to capture new types of data: geolocation, clickstream,
server log, sensor & unstructured
• Unable to exploit key competitive advantage: unique customer insight from troves of
big data
Solution
Efficient data storage unlocks value in company data
• Hadoop data lake permits view into how customers use products across multiple
types of data
• Lower cost of storage improves the margin for retaining data
• Powerful cluster includes many key ecosystem projects: Hive, Hbase, HCatalog,
Pig, Flume, Sqoop, Ambari, Oozie, Knox, Falcon, Tez and YARN
Manufacturing
Consumer electronics
>$180B in revenue
>400K employees
MF1

/ 156
Data: SensorOptimizing High-Tech Manufacturing
Problem
Data scarcity for root cause analysis on products defects
• 200 million digital storage devices manufactured yearly
• Devices not passing QA scrapped at the end of the line
• >10K faulty devices returned by customers every month
• Limited data available for root cause analysis means that diagnosing problems is
highly manual (physical inspections)
• Subset of sensor data from QA testing retained 3-12 months
Solution
Data retention doubled, with 10x processing improvement
• Repository of sensor data now holds larger portion of total data
• Dashboard created 10x more quickly than before Hadoop
• Data retained for at least 24 months
• Manufacturing dashboard allows >1,000 employees to search data, with results
returned in less than 1 second
Manufacturing
Digital Storage Devices
>$15B in revenue
>85K employees
MF2

/ 156
Data: Clickstream &
Server Log
Social Site Speeds Processing, Reduces Cost
Problem
Data growth outpaced existing Greenplum solution
• 20M monthly unique visitors, and growing
• Greenplum storage solution was slow and expensive
• Operations team challenged by data growth
• Analytics team hampered by slow processing speed
Solution
Processing speeds doubled, storage cost decreased
• Operations team saw processing speed 2x of Greenplum’s
• Significant cost savings from moving data to HDP
• During this second year of support relationship, plans to move more workloads to
HDP, for better insights at a lower cost
Online Community
Online social network
>$50M in revenue
>300M members
2nd
year with Hortonworks
OC1

/ 156
Data: Clickstream,
Server Log & Social
Powering Professional Network Recommendations
Problem
Lack of a recommendation engine to promote connections
• >13M non English-speaking members find jobs & connections
• User interactions generate semi-structured data
• Clickstream, server log and social data could feed recommendations
• Company lacked stable platform to store, refine & enrich that raw data
Solution
Hadoop recommendation engine to compete with LinkedIn
• Replaced existing CDH cluster
• New types of data feed a superior recommendation engine that enhances the
value of belonging to the community
• YARN, Tez and Stinger initiative provide near-term functionality and long-term
confidence
Online Community
Online professional network
>$90M in revenue
>13M members
OC2

/ 156
Better Romantic Matches with Data Science
Problem
Newer types of data unavailable for matchmaking algorithms
• Unable to store clickstream data and user-entered content
• Other types of data only retained for seven days
• Recommendations would help users craft attractive profiles
• High costs to store an ever growing amount of data
• Relational data platform did not fulfill their requirements
Solution
Hadoop cluster for A/B testing, device analysis, text mining
• A/B testing: consolidate email & clickstream from SQL databases
• Usage patterns across devices, browsers and applications. Understand who uses
their mobile app.
• Mine user-created text (profile language and user-to-user communications) for
recommendation engine
• Longer data retention: find subtle trends with longer time window
Online Community
Online dating site
>300 employees
OC3

/ 156
360° View of Customer for Call Center Sales
Problem
Call center sales reps unable to recommend best product
• 2000+ product lines
• Multiple customer interaction channels (web, Salesforce, face-to-face, phone)
• Poor visibility causes sales reps to miss opportunities and customer satisfaction
suffers
Solution
Improve sales conversions with optimal product recs
• Call center reps will understand every interaction with the customer, to improve
service calls
• Natural language analysis of rep emails to customers identifies best response
language and coaching opportunities
• Recommendation engine predicts the next best product for each customer
Data: Unstructured
Retail
IT solution and equipment
reseller
>$10B in revenue
>6K employees
RT1

/ 156
360° Customer View for Home Supply Retailer
Problem
Lack of a unified customer record across all channels
• Global distribution online, in home and across 2000+ stores
• Unable to create “golden record” for analytics on customer buying behavior
across all channels
• Data repositories on website traffic, POS transactions and in-home services
existed in isolation of each other
• Limited ability for targeted marketing to specific segments
• Data storage costs increasing
Solution
HDP delivers targeted marketing & data storage savings
• Golden record enables targeted marketing capabilities: customized coupons,
promotions and emails
• Data warehouse offload saved millions in recurring expense
• Customer team continues to find unexpected, unplanned uses for their 360
degree view of customer buying behavior
Data: Clickstream,
Unstructured, Structured
Retail
Major home improvement
retailer
>$74B in revenue
>300K employees
>2,200 stores
RT2

/ 156
Using In-Store Location Data to Improve Cross-Sell
Problem
Retailer lacks data on how customers move through stores
• Placement of product within department stores affects sales
• Sales data is not specific enough to suggest specific changes
• Online retailers can compare what shoppers view with what they buy, but they
lack this insight in brick and mortar stores
• Result: critical decisions about store layout, inventory
Solution
Micro-data on shopper location enables in-store analysis similar to website
analysis: locations visited v. purchases
• Apple iBeacon app captures in-store location data for shoppers that have the
app on their iPhones
• Data streams into HDFS on how customers move through their stores, relative
to location of particular products
• Enables real-time promotions to customers w/ smart phones, based on who
they are and where they stand in the store
• Historical data across all shoppers provides insight on store design
Data: Sensor &
Geolocation
Retail
Major omni-channel retailer
> $27B in revenue
>175K employees
>800 stores
RT3

/ 156
Unified Data for Online Recommendation Engine
Problem
5 data sets are fragmented, hampering product recs
• 5 major data sets: inventory data, transactional data, user behavior data, customer
profiles & log data
• Unified view needed, to recommend items to users
• Currently lack analytics dashboard across all types of data
• Storing non-transactional data on EDW is expensive
Solution
Unified data lake for increased sales and lower costs
• Unified 360° view for recommendations of similar products
• Analytics dashboard joins clickstream w/ transactional data
• Summary data stored in HBase, can be queried with web apps
• Offload some data from Teradata EDW, to lower storage costs
• Actively partnering with engineers to improve Hadoop
Data: Structured,
Clickstream, Server Log &
Unstructured
Retail
eCommerce marketplace
>$12B in revenue
>30K employees
RT4

/ 156
Predicting Car Prices With High Confidence
Problem
Achieving 99.1% confidence in car price estimates
• Goal to provide consumers & dealers reliable car price guides
• Promise: 99.1% confidence that projected price paid will be within $20 of the
average national price paid in a given week
• As network of dealers grew, existing SQL Server data warehouse was expensive
and difficult to scale
Solution
Cost savings & data reliability at scale in a data lake
• Mission-critical price data moved to Hadoop architecture
• Server log data flows into HDP with Flume
• Analysis of this data allows analysts to further improve accuracy of estimates
Retail
Online eCommerce service
for buying and selling cars
~300 employees
RT5

/ 156
Recommendation Engine Improves Department Store Sales
Problem
Need to create better product recommendations
• Multiple touch points: store, kiosk, web and mobile app
• Wants to promote customized promotions, coupons & recs
• Data was not integrated, making 360-view of customer behaviors impossible
Solution
Recommendations to all channels, based on data lake
• Ingest all raw data from different product lines into HDP
– Real-time data ingestion
– Structured data ingestion
• Transform raw data
– ETL processing with Pig and Hive
– Use Mahout and R to make recommendations
• Recommendations will be fed to all channels
– HBase serves recommendations to web site, kiosk and mobile app
Data: ETL
Retail
Specialty department store
>$19B in revenue
>130K employees
RT6

/ 156
Faster Reports for Real Estate Agents
Problem
Accelerate reports on movers for real estate agents
• 20 million monthly visitors to family of websites
• Reports on movers not consistently generated quickly enough
• Pressure from newer market entrants
• High data storage costs reduce margins on data
Solution
More data for faster reports at a lower cost
• Improved analytical efficiency speeds report turnaround
• Data storage costs lower than before
• Improved visibility into macro trends in real estate
• Refine, explore and enrich the data better than competitors
Data: Clickstream & ETL
Software
Operator of real estate
websites
~$200M in revenue
>1,000 employees
SW1

/ 156
Unified View Across Products, for Product Managers
Problem
Data fragmentation across products and verticals
• More than 20 product lines
• Multiple verticals: retail, financial services, healthcare, manufacturing,
communications, utilities & government
• Each product line has a separate data repository
• Unified analysis across product lines was impossible
Solution
Data consolidation for cross-product customer analysis
• Product managers will have unified data for analysis
• Raw data from different products will land in HDP
• Data will then be refined and transformed
• Real time data ingestion with Flume
• Batch data movement with Sqoop
• ETL processing with Pig and Hive
Data: ETL
Software
Data security software,
cloud computing
~$130M in revenue
~1,100 employees
SW2

/ 156
Data Lake Protects Customers’ Enterprise Data Security
Problem
Batch processing created risk exposure, redundant systems drove costs up
• Customer protects the world’s largest organizations from data security breaches and
backs up their mission-critical data
• Process client data to identify threats and vulnerabilities
• Multiple acquisitions led to a redundant patchwork of big data analysis solutions,
including: Greenplum, Netezza and Vertica
• Six LOBs needed a common, multi-tenant data repository
• Existing batch processing caused 15-minute latency window, with exposure risk
Solution
HDP data lake consolidates infrastructure, reduces cost & speeds response times
• Consolidation into one HDP data lake represents savings of tens of millions of dollars
• Multi-tenancy with YARN permits secure access to multiple LOBs
• Real-time analysis with Apache Storm and interactive query with Apache Hive close the
15-minute risk window from earlier architecture
• Data lake also used for marketing: clickstream analysis & 360-degree customer view
Data: Server Log,
Clickstream & ETL
Software
Global leader in data security,
storage and system
management software
>$6B in revenue
>18K employees
SW3

/ 156
Launching New Data Analysis Products
Problem
Enterprise customers have no visibility into performance
• Platforms connect 3.4 billion transactions per year
• Currently storing 90TB, growing at 20% YoY
• All divisions retain 36 months, except healthcare network: 7yrs
• Customers have no visibility into their companies’ activity on their commerce
platforms
• Client wants to add analytics services to cross-sell to existing customers and
attract new customers
Solution
HDP data lake enables launch of new information products
• Shorten data processing workloads from days to hours
• Enable ad hoc analytics queries
• Create data analysis products and services for customers of promotion, supply
chain and healthcare networks
• New product: anonymous reports that benchmark customer against competitors in
same industry
Data: ETL
Software
Operator of intelligent
ecommerce networks
>1,400 customers
~5K employees
SW4

/ 156
Product Managers Speed Product Innovation with Hadoop
Problem
Product managers needed to analyze server logs
• 130K clients drive 780M transactions per day
• Services incorporate streams from core CRM and 3rd
party platforms like
Twitter, Facebook and YouTube
• Product managers need to capture and interpret server log data to analyze new
feature adoption & performance
• Unable to process current volume using relational data stores
• Unable to retain enough data because of cost
Solution
HDP gives PMs power, reliability and liberty
• Power: Analysis of more than 30TB per month
• Reliability: Previous system broke every 2 weeks. No longer.
• Liberty: Open source solution prevents vendor lock
• HDP increases Product Management storage and analysis without
corresponding increase in IT spend
Data: Server Log
Software
Sales & CRM software,
cloud computing
~$3B in revenue
~10K employees
SW5

/ 156
eCommerce Platform Uses Data Lake for Insight
Problem
New types of data difficult to store, unavailable for analysis
• Millions of payments processed every day
• Fraudsters selling fake items or extract buyer account info
• Some creditors default, resulting in losses
• Unable to store current volume using relational data stores
• Unable to retain vintage data because of RDBMS storage cost
Solution
HDP data lake accelerates multiple analysis projects
• Platform stores all new types of data: clickstream, social, sensor, geolocation,
server logs and unstructured data
• Detects and prevents theft: fraudsters stealing from members
• Assesses credit risk: server log analysis & machine learning
• Manages offers: aggregates data for advertisers
• User experience: social sentiment analysis on usability
• Site optimization: analyze clickstream for site improvements
Data: Server Log
Software
eCommerce payments
platform
~$6B in revenue
>130M users
~13K employees
SW6

/ 156
Offloading Clickstream Data from Netezza
Problem
System receives millions of call detail records per second
• Netezza EDW operating near capacity
• Netezza housing exhaust data not required for intended reporting and analytics,
leading to unnecessary expense
• Enterprise IT maintained redundant data stores
• Unable to store clickstream data to enrich consumer intelligence
Solution
Longer storage, lower cost & better consumer intelligence
• Hadoop will recover premium Teradata cycles, currently used for transformations
and data movement
• Projected costs savings of >$1M by offloading exhaust data
• Analysis of clickstream adds new dimension of customer view
• Improved service efficiency: bill processing & reporting
Data: ETL & Clickstream
Telecom
Major telecom provider
~ $25B in revenue
> 40M customers
TC1

/ 156
Unified Household View of the Customer
Problem
Acquisitions & data explosion fragment view of customer
• Recent acquisitions and proliferation of types of data caused fragmented view of
customers
• Data exists across multiple applications & data stores
• Semi-structured data: social, sensors & networked devices
• Difficult to integrate structured, semi-structured & unstructured data sets from so
many distinct sources
Solution
HDP data lake delivers 360° unified household View
• Stable environment for exploring and enriching the data
• Store all of the data and retain it for longer
• Parse on demand: no need to pre-parse data before loading
• Analysis on demand: allows analysts to explore raw data and find unexpected
truths in the data
Data: ETL, Social,
Sensor & Clickstream
Telecom
Major telecom provider,
offering data networks &
services
> $100B in revenue
> 200K employees
TC2

/ 156
Call Record Analysis for Improved Cell Service
Problem
System receives millions of call detail records per second
• System enables proactive management of phone call quality
• Call detail records (CDRs) are the raw data used for analysis
• Millions of CDRs stream in every second
• Storage is expensive & ingest rates are increasing 20% YoY
• 24-hour data retention not sufficient to discover long-term trends
Solution
Longer storage & rich analysis improve customer service
• HDP’s 10:1 compression allows affordable 6 month retention
• Improved forensics on instances of poor call quality drive:
– Informed decisions on expansion of transmission infrastructure
– Predictive analytics on when to repair/replace equipment
• Access to more data helps service reps solve customer issues in near real-time
Data: Sensor
Telecom
Major telecom provider,
offering data networks &
services
> $100B in revenue
> 200K employees
TC3

/ 156
ETL: 100x the Data, 12x Longer, $3M Saved
Problem
Changing business model required new data architecture
• Started in 1990s as neutral intermediary for telco networks
• Network management market is mature
• CEO challenged company to build business for data analysis and information
services, related to telecom data
• Netezza data capacity limited to 20TB
• Only stored 1% of total dataset, retained for only 60 days
Solution
More data, stored longer, with $3 million in cost savings
• Avoided $3M annual expense, compared to Netezza
• Now storing 100% of data, retained for two years
• Larger data set supports new, accurate information products
• Improved access to data for more employees drives new innovation across the
enterprise
Data: ETL
Telecom
Telco information and
analytics vendor
$800M in revenue
~2,500 employees
TC4

/ 156
Searchable ETL for CDRs & Customer Data
Problem
Data storage costs limit the amount and types of data available for analysis
• Teradata and Vertica used for data storage, ideal for certain data workloads, but
unsuited for less structured types of data
• Limited retention of call detail records (CDRs)
• Limited analysis across call logs, CRM records & customer acquisition models
Solution
Data lake: ETL, data exploration & NPTB recommendations
• Partners Teradata, HP and Impetus helped craft a solution
• CDRs now retained for longer, improving visibility & analysis
• Customer retention data can be correlated to service quality
• Plan to integrate search for real-time NPTB recommendations
• Improved customer acquisition and retention
Data: Structured, Server Log
& Geo-location
Telecom
Telco vendor specializing in
VOIP
> $800M in revenue
> 2M subscribers
~ 1,000 employees
TC5

/ 156
Better Service to Premium Customers, for Less
Problem
Inability to identify base stations serving premier customers
• CRM system and network logs were in isolated data silos
• Company unable to analyze base station usage by premium customers, to prioritize
investments
• Info gap prevented optimal ROI on infrastructure investments
Solution
HDP joins structured CRM & unstructured network data at scale
• Partnered with Datameer and HP to deliver a unified solution
• Joins network data on utilization of base stations with CRM data on the value of
customers using those stations most often
• Optimizes service to the most valuable customers
• Efficient resource allocation reduces overall cost to maintain network infrastructure
Data: Structured &
Server Log
Telecom
Major European telco
> $800M in revenue
> 300M customers
> 100K employees
TC6

Data Driven Journey to Demystifying Big Data

Data Driven Journey to Demystifying Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Driven Journey to Demystifying Big Data

Similar to Data Driven Journey to Demystifying Big Data (20)

Recently uploaded

Recently uploaded (20)

Data Driven Journey to Demystifying Big Data