SlideShare a Scribd company logo
1 of 31
Download to read offline
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.
Big Data 101
Pradeep Varadan
Enterprise Architecture
Mar 2014
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 2
Agenda
• What is Big Data ?
 Hype
 Facts
 Definition
• Why the upsurge ?
 Re-thinking data
 Rethinking processes
• Technology
 Current constraints
 RDBMS vs. Hadoop
 Hadoop
 No SQL
• Use Cases
 Cross Industry examples
 Netflix
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 3
data
Big Data
What is Big Data ?
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 4
 Social media
 Server logs
 Web clickstream
 Machine/sensor
 Geo-location
What is Big Data ?
Hobbyist Desktop Internet Big Data
Kb Gb Pb Zb
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 5
“high-volume, -velocity and -variety information assets
that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making” - Gartner
What is Big Data ?
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 6
• What is Big Data ?
 Hype
 Facts
 Definition
• Why the upsurge ?
 Re-thinking data
 Rethinking processes
• Technology
 Current constraints
 RDBMS vs. Hadoop
 Hadoop
 No SQL
• Use Cases
 Cross Industry examples
 Netflix
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 7
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze small subsets
of information
Analyze
all information
Analyzed
information
All available
information
All available
information
analyzed
Rethinking data #1
Move from samples to populations
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 8
TRADITIONAL APPROACH BIG DATA APPROACH
Start with hypothesis and
test against selected data
Explore all data and
identify correlations
Hypothesis Question
DataAnswer
Data Exploration
CorrelationInsight
Let data do the talking
Rethinking data #2
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 9
TRADITIONAL APPROACH BIG DATA APPROACH
Carefully cleanse information
before any analysis
Analyze information as is,
cleanse as needed
Small
amount of
carefully
organized
information
Large
amount of
messy
information
Fail fast or progress iteratively
Rethinking processes #1
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 10
Rethinking processes #2
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze data after it’s been
processed and landed in a warehouse
or mart
Analyze data in motion as it’s
generated, in real-time
Repository InsightAnalysisData
Data
Insight
Analysis
Provide insight in real time
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11
• What is Big Data ?
 Hype
 Facts
 Definition
• Why the upsurge ?
 Re-thinking data
 Rethinking processes
• Technology
 Current constraints
 Hadoop
 RDBMS vs. Hadoop
 No SQL
• Use Cases
 Cross Industry examples
 Netflix
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 12
Constraints of the current environment
Category Existing
Optimization
Ask
Data Type Structured Unstructured
H/W Scalability Vertical Horizontal
Reliability Pricy H/W Free S/W
Interoperability Closed by Vendor Open source
IO Write less, Read more Write more, Read less
Insight Newspaper/daily Near Real time
Data retention Filtered/Limited Unfiltered/Unlimited
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 13
Big Data Technologies
• Hadoop
• NO SQL
• Analytics/Visualization (Out of Scope)
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 14
How did Hadoop come about ?
Year Google
2004 GFS, Map Reduce
2005 Sawzall
2006 Big Table
2010 Dremel/F1
…. ……
2012 Spanner
Year Open Source
2006 HDFS
2008 Pig, Hive
2008 HBase
2013 Impala
… ….
? ?
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 15
DFS Message Path
MapReduce Processing Msg
D
N
TT
D
N
TT
D
N
TT
D
N
TT
D
N
TT
D
N
TT
D
N
TT
D
N
TT
…
… …
Name
Node
Job
Tracker
HDFS: Distributed compute and storage
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 16
Map Reduce : visual example
Map Shuffle ReduceDistribute
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 17
Hadoop Reference architecture
hadoop - hdfs, map reduce
sqoop - db to hdfs
flume - log to hdfs
hbase - columnar store - big table - key,value
pig - python, ruby, php
hive - sql query
oozie - worklflow co-ordination , xml based, scheduler/job-orchestration
zookeeper - co-ordinator ; misc admin functions: locking, messaging,
mailboxes, leader election
fuse-dfs - hdfs volumes in linux
avro - data serialization/rpc
mahout - machine learning
dumbo - python library for streaming
vaidya - Performance benchmarking framework
chukwa - cluster monitor
Lucene - text search
scribe - log collection
storm - real time processing
Welcome to the zoo!
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 19
Hadoop companies
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 20
Interfaces to Hadoop
Analytics
DataPrep
CRM
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 21
Hadoop Vs. Relational Databases
• Write first, think later
• Think first, write next
Hadoop : Schema-on-read
RDBMS: Schema-on-write
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 22
NO SQL – Not Only SQL
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 23
NO SQL Types
• Column family:
 Aggregate OLAP oriented, Primary Key is data mapping back to row ids
 HBase, Accumulo, Cassandra
– NSA uses Accumulo with cell level security for PRISM
• Document store:
 Object Oriented encapsulation ,Encoding (XML, YAML, JSON, and BSON)
 MarkLogic, MongoDB, Couchbase
– Metlife uses MongoDB for “The Wall’ /Customer 360 View CRM
• Key-value:
– (key,value) based lookups , Associative array with hash table
– Dynamo, Riak, Voldemort
– LinkedIn used Voldemort behind ‘Who viewed my profile?’
• Graph:
– graph structures with nodes, edges, and properties ; index-free adjacency,
– Neo4J, Allegro, Virtuoso
– TwitLogic semantic web using twitter data
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 24
SQL vs. NO SQL
SQL NO SQL
Relational Distributed/Hierarchical
Tables Key Value pairs, Documents, Graphs,
Column families
Pre-defined schema Dynamic schema
Vertically scalable Horizontally scalable
SQL UnQL(more programming)
Complex queries on small data Simple queries on large data
ACID BASE
Vertically scalable Horizontally scalabale
Defined data model Model inside application
Cumbersome set up – DBA Ease of set up
Simple data Complex data
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 25
Eventually consistent
“CAP Theorem is a set of basic requirements that describe any distributed system
not just storage or database”
“You cannot have a clustered system that supports all of the
following three qualities: consistency, availability, partition-tolerant” -
CAP Theorem by Prof. Eric Brewer (Berkeley)
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 26
Agenda
• What is Big Data ?
 Hype
 Facts
 Definition
• Why the upsurge ?
 Re-thinking data
 Rethinking processes
• Technology
 Current constraints
 Hadoop
 RDBMS vs. Hadoop
 No SQL
• Use Cases
 Cross Industry examples
 Netflix
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 27
Big Data Use Cases
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 28
“House of Cards” is one of the first major test cases of this Big Data-
driven creative strategy. Detailed knowledge of Netflix subscriber
viewing preferences clinched their decision to license a remake of the
popular and critically well regarded 1990 BBC miniseries. Netflix’s data
indicated that the same subscribers who loved the original BBC
production also gobbled down movies starring Kevin Spacey or directed
by David Fincher. Therefore, concluded Netflix executives, a remake of
the BBC drama with Spacey and Fincher attached was a no-brainer, to
the point that the company committed $100 million for two 13-episode
seasons.
Use Cases
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 29
Where are we headed ?
• H/W
– Couch - cluster of unreliable commodity hardware
– Software defined storage reliability
• S/W
– HDFS will be the new UNIX (distributed FS)
– Open Source software
• Data Ingestion
– Online transactions + Batch file + Streaming torrents
• Technical Architecture
– Shared nothing
– Data centric (Process will move to data)
• Backup and recovery ?
• Scalability
– Horizontal
– Vertical
• Mixed workloads
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 30
References
• McKinsey
• Gartner
• Forrester
• Wikibon
• IBM big data
• Oracle Big Data
• Aster
• MapR
• Cloudera
• Wikipedia Big Data
• Wikipedia NO SQL
• MongoDB
• Use Cases
Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 31
Thank you
pradeepvaradan Pradeep.Varadan@Verizon.com

More Related Content

Similar to 20140324 big data_101_v7

How Verizon Uses Disruptive Developments for Organized Progress
How Verizon Uses Disruptive Developments for Organized ProgressHow Verizon Uses Disruptive Developments for Organized Progress
How Verizon Uses Disruptive Developments for Organized ProgressMongoDB
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...DataWorks Summit
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyDataWorks Summit
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesDataWorks Summit
 
Monitoring and troubleshooting spring boot microservices arch in production o...
Monitoring and troubleshooting spring boot microservices arch in production o...Monitoring and troubleshooting spring boot microservices arch in production o...
Monitoring and troubleshooting spring boot microservices arch in production o...VMware Tanzu
 
How to Deal with Constant Change by Verizon Product Manager
How to Deal with Constant Change by Verizon Product ManagerHow to Deal with Constant Change by Verizon Product Manager
How to Deal with Constant Change by Verizon Product ManagerProduct School
 
Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...
Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...
Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...Lattice Engines
 
Mobile technology andy brady - chicago tour
Mobile technology   andy brady - chicago tour Mobile technology   andy brady - chicago tour
Mobile technology andy brady - chicago tour Ramon Ray
 
VerizonFinalPresentation_TomCruz
VerizonFinalPresentation_TomCruzVerizonFinalPresentation_TomCruz
VerizonFinalPresentation_TomCruzTom Cruz
 
DWS16 - Connected things forum - David Vasquez, Verizon Enterprise Solutions
DWS16 - Connected things forum - David Vasquez, Verizon Enterprise SolutionsDWS16 - Connected things forum - David Vasquez, Verizon Enterprise Solutions
DWS16 - Connected things forum - David Vasquez, Verizon Enterprise SolutionsIDATE DigiWorld
 
Gaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate EnvironmentGaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate EnvironmentDataWorks Summit
 
Verizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott Spector
Verizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott SpectorVerizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott Spector
Verizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott SpectorScott Spector
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsStorage Switzerland
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data TrendsIMC Institute
 
Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014
Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014
Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014pietvz
 
Big Data in Action – Real-World Solution Showcase
 Big Data in Action – Real-World Solution Showcase Big Data in Action – Real-World Solution Showcase
Big Data in Action – Real-World Solution ShowcaseInside Analysis
 

Similar to 20140324 big data_101_v7 (20)

How Verizon Uses Disruptive Developments for Organized Progress
How Verizon Uses Disruptive Developments for Organized ProgressHow Verizon Uses Disruptive Developments for Organized Progress
How Verizon Uses Disruptive Developments for Organized Progress
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Monitoring and troubleshooting spring boot microservices arch in production o...
Monitoring and troubleshooting spring boot microservices arch in production o...Monitoring and troubleshooting spring boot microservices arch in production o...
Monitoring and troubleshooting spring boot microservices arch in production o...
 
How to Deal with Constant Change by Verizon Product Manager
How to Deal with Constant Change by Verizon Product ManagerHow to Deal with Constant Change by Verizon Product Manager
How to Deal with Constant Change by Verizon Product Manager
 
Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...
Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...
Transforming Your Revenue Engine: How Verizon uses AI and Data to Accelerate ...
 
Mobile technology andy brady - chicago tour
Mobile technology   andy brady - chicago tour Mobile technology   andy brady - chicago tour
Mobile technology andy brady - chicago tour
 
VerizonFinalPresentation_TomCruz
VerizonFinalPresentation_TomCruzVerizonFinalPresentation_TomCruz
VerizonFinalPresentation_TomCruz
 
DWS16 - Connected things forum - David Vasquez, Verizon Enterprise Solutions
DWS16 - Connected things forum - David Vasquez, Verizon Enterprise SolutionsDWS16 - Connected things forum - David Vasquez, Verizon Enterprise Solutions
DWS16 - Connected things forum - David Vasquez, Verizon Enterprise Solutions
 
Gaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate EnvironmentGaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate Environment
 
Verizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott Spector
Verizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott SpectorVerizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott Spector
Verizon NAB Show Media Cloud Ecosystem April 6, 2015 Final Scott Spector
 
1415 gold sanford
1415 gold sanford1415 gold sanford
1415 gold sanford
 
xGem BigData
xGem BigDataxGem BigData
xGem BigData
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data Analytics
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data Trends
 
Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014
Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014
Take the Big Data Challenge - Take Advantage of ALL of Your Data 16 Sept 2014
 
Verizon January 8, 2014
Verizon   January 8, 2014Verizon   January 8, 2014
Verizon January 8, 2014
 
Big Data in Action – Real-World Solution Showcase
 Big Data in Action – Real-World Solution Showcase Big Data in Action – Real-World Solution Showcase
Big Data in Action – Real-World Solution Showcase
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

20140324 big data_101_v7

  • 1. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. Big Data 101 Pradeep Varadan Enterprise Architecture Mar 2014
  • 2. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 2 Agenda • What is Big Data ?  Hype  Facts  Definition • Why the upsurge ?  Re-thinking data  Rethinking processes • Technology  Current constraints  RDBMS vs. Hadoop  Hadoop  No SQL • Use Cases  Cross Industry examples  Netflix
  • 3. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 3 data Big Data What is Big Data ?
  • 4. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 4  Social media  Server logs  Web clickstream  Machine/sensor  Geo-location What is Big Data ? Hobbyist Desktop Internet Big Data Kb Gb Pb Zb
  • 5. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 5 “high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” - Gartner What is Big Data ?
  • 6. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 6 • What is Big Data ?  Hype  Facts  Definition • Why the upsurge ?  Re-thinking data  Rethinking processes • Technology  Current constraints  RDBMS vs. Hadoop  Hadoop  No SQL • Use Cases  Cross Industry examples  Netflix
  • 7. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 7 TRADITIONAL APPROACH BIG DATA APPROACH Analyze small subsets of information Analyze all information Analyzed information All available information All available information analyzed Rethinking data #1 Move from samples to populations
  • 8. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 8 TRADITIONAL APPROACH BIG DATA APPROACH Start with hypothesis and test against selected data Explore all data and identify correlations Hypothesis Question DataAnswer Data Exploration CorrelationInsight Let data do the talking Rethinking data #2
  • 9. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 9 TRADITIONAL APPROACH BIG DATA APPROACH Carefully cleanse information before any analysis Analyze information as is, cleanse as needed Small amount of carefully organized information Large amount of messy information Fail fast or progress iteratively Rethinking processes #1
  • 10. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 10 Rethinking processes #2 TRADITIONAL APPROACH BIG DATA APPROACH Analyze data after it’s been processed and landed in a warehouse or mart Analyze data in motion as it’s generated, in real-time Repository InsightAnalysisData Data Insight Analysis Provide insight in real time
  • 11. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11 • What is Big Data ?  Hype  Facts  Definition • Why the upsurge ?  Re-thinking data  Rethinking processes • Technology  Current constraints  Hadoop  RDBMS vs. Hadoop  No SQL • Use Cases  Cross Industry examples  Netflix
  • 12. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 12 Constraints of the current environment Category Existing Optimization Ask Data Type Structured Unstructured H/W Scalability Vertical Horizontal Reliability Pricy H/W Free S/W Interoperability Closed by Vendor Open source IO Write less, Read more Write more, Read less Insight Newspaper/daily Near Real time Data retention Filtered/Limited Unfiltered/Unlimited
  • 13. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 13 Big Data Technologies • Hadoop • NO SQL • Analytics/Visualization (Out of Scope)
  • 14. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 14 How did Hadoop come about ? Year Google 2004 GFS, Map Reduce 2005 Sawzall 2006 Big Table 2010 Dremel/F1 …. …… 2012 Spanner Year Open Source 2006 HDFS 2008 Pig, Hive 2008 HBase 2013 Impala … …. ? ?
  • 15. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 15 DFS Message Path MapReduce Processing Msg D N TT D N TT D N TT D N TT D N TT D N TT D N TT D N TT … … … Name Node Job Tracker HDFS: Distributed compute and storage
  • 16. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 16 Map Reduce : visual example Map Shuffle ReduceDistribute
  • 17. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 17 Hadoop Reference architecture
  • 18. hadoop - hdfs, map reduce sqoop - db to hdfs flume - log to hdfs hbase - columnar store - big table - key,value pig - python, ruby, php hive - sql query oozie - worklflow co-ordination , xml based, scheduler/job-orchestration zookeeper - co-ordinator ; misc admin functions: locking, messaging, mailboxes, leader election fuse-dfs - hdfs volumes in linux avro - data serialization/rpc mahout - machine learning dumbo - python library for streaming vaidya - Performance benchmarking framework chukwa - cluster monitor Lucene - text search scribe - log collection storm - real time processing Welcome to the zoo!
  • 19. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 19 Hadoop companies
  • 20. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 20 Interfaces to Hadoop Analytics DataPrep CRM
  • 21. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 21 Hadoop Vs. Relational Databases • Write first, think later • Think first, write next Hadoop : Schema-on-read RDBMS: Schema-on-write
  • 22. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 22 NO SQL – Not Only SQL
  • 23. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 23 NO SQL Types • Column family:  Aggregate OLAP oriented, Primary Key is data mapping back to row ids  HBase, Accumulo, Cassandra – NSA uses Accumulo with cell level security for PRISM • Document store:  Object Oriented encapsulation ,Encoding (XML, YAML, JSON, and BSON)  MarkLogic, MongoDB, Couchbase – Metlife uses MongoDB for “The Wall’ /Customer 360 View CRM • Key-value: – (key,value) based lookups , Associative array with hash table – Dynamo, Riak, Voldemort – LinkedIn used Voldemort behind ‘Who viewed my profile?’ • Graph: – graph structures with nodes, edges, and properties ; index-free adjacency, – Neo4J, Allegro, Virtuoso – TwitLogic semantic web using twitter data
  • 24. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 24 SQL vs. NO SQL SQL NO SQL Relational Distributed/Hierarchical Tables Key Value pairs, Documents, Graphs, Column families Pre-defined schema Dynamic schema Vertically scalable Horizontally scalable SQL UnQL(more programming) Complex queries on small data Simple queries on large data ACID BASE Vertically scalable Horizontally scalabale Defined data model Model inside application Cumbersome set up – DBA Ease of set up Simple data Complex data
  • 25. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 25 Eventually consistent “CAP Theorem is a set of basic requirements that describe any distributed system not just storage or database” “You cannot have a clustered system that supports all of the following three qualities: consistency, availability, partition-tolerant” - CAP Theorem by Prof. Eric Brewer (Berkeley)
  • 26. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 26 Agenda • What is Big Data ?  Hype  Facts  Definition • Why the upsurge ?  Re-thinking data  Rethinking processes • Technology  Current constraints  Hadoop  RDBMS vs. Hadoop  No SQL • Use Cases  Cross Industry examples  Netflix
  • 27. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 27 Big Data Use Cases
  • 28. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 28 “House of Cards” is one of the first major test cases of this Big Data- driven creative strategy. Detailed knowledge of Netflix subscriber viewing preferences clinched their decision to license a remake of the popular and critically well regarded 1990 BBC miniseries. Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons. Use Cases
  • 29. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 29 Where are we headed ? • H/W – Couch - cluster of unreliable commodity hardware – Software defined storage reliability • S/W – HDFS will be the new UNIX (distributed FS) – Open Source software • Data Ingestion – Online transactions + Batch file + Streaming torrents • Technical Architecture – Shared nothing – Data centric (Process will move to data) • Backup and recovery ? • Scalability – Horizontal – Vertical • Mixed workloads
  • 30. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 30 References • McKinsey • Gartner • Forrester • Wikibon • IBM big data • Oracle Big Data • Aster • MapR • Cloudera • Wikipedia Big Data • Wikipedia NO SQL • MongoDB • Use Cases
  • 31. Confidentialand proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 31 Thank you pradeepvaradan Pradeep.Varadan@Verizon.com