SlideShare a Scribd company logo
"We tend to overestimate the effect of a technology in the short run
and underestimate the effect in the long run.“-Amara’s Law.
“The Best Way to Predict the Future is to Create it”. – Peter Drucker.
Author – Karthik Padmanabhan
Deputy Manager, Research and Advanced Engineering,
Ford Motor Company.
Image source: isa.org
What is Big Data
Data will not be termed as Big Data based on Size alone. There are multiple
factors to it.
1. Volume – Corporate Data has grown to Peta-byte level and still growing…
2. Velocity –Data is changing each moment and its history needs to be tracked
as well as planning on its utilization
3. Variety – Data coming from wide variety of sources and in various formats.
Now people include one more aspect to it called Veracity. So essentially what it
means is that the dimensionality of the term Big Data is increasing by the day.
So finally the curse of dimensionality comes into play wherein it becomes that
we cannot use normal methods to really get some insights from the data.
Innovation is the key to handle such a beast.
Dimensions of Big Data
Image source: datasciencecentral.com
Image source: vmware.com
Life Cycle – From Start to End
Image source: doubleclix.wordpress.com
Need for Big Data
Brutal Fact - 80% of data is unstructured data and it is growing at 15% annually. So in
the next 2 years the data size will double as of today.
Single View: Integrated analysis of the customer and transaction data.
E-Commerce Business : Storing huge amount of click-stream data. Need to measure
the entire Digital Footprint.
Text Processing applications: Social Media text mining. Here the entire landscape
changes as it involves different set of metrics at higher dimensions which increases
the complexity of the application. Distributed Processing is the way to go.
Real Time Actionability: Immediate feedback on product launch through analysis of
social media comments instead of waiting for customer satisfaction survey.
Change in Consumer Psychology: Necessity for Instant Gratification.
The 2013 Gartner Hype Cycle Special Report evaluates the maturity of over 2,000 technologies
and trends in 102 areas. New Hype Cycles this year feature content and social analytics,
embedded software and systems, consumer market research, open banking, banking operations
innovation, and ICT in Africa.
http://www.gartner.com/technology/research/hype-cycles/
Hype Cycles - Gartner
Big Data – Where it is used
• Work-Force Science
• Astronomy (Hubble telescope)
• Gene and DNA expressions.
• Finding out cancerous cells which causes a disease.
• Fraud detection
• Video and Audio Mining
• Automotive Industry – We will see use cases in the
next slide.
• Consumer focused marketing
• Retail
Automotive Industry
“If the automobile had followed the same
development cycle as the computer, a Rolls-
Royce would today cost $100, get a million miles
per gallon, and explode once a year, killing
everyone inside.”
– Robert X. Cringely
Use Cases - Automotive
• Vehicle Insurance
• Personalized Travel & Shopping Guidance
Systems
• Supply Chain/Logistics
• Auto-Repairs
• Vehicle Engineering
• Vehicle Warranty
• Customer Sentiment
• Customer Care Call Centers
Use Cases - Automotive
• Self Driving Cars (Perspective)
 Sensor Data generates 1 GB of Data per second
 A Person drives 600 hours per year on Average.
 2,160,000 (600*60*60) seconds  2 Petabyte of data per car
per year.
 Total Number of cars in the world to surpass 1 billion.
  SO you can do the MATH 
• Smart Parking using Mesh Networks – Visuals in
the next slide.
Smart parking – Mesh Networks
Datacenter- Where Data Resides
• “eBay Keynote – 90M Users, 245M items on
sale, 93B active database calls…that’s a busy
DATACENTER” - Gartner DC
Google Datacenter – Over the years
Welcome to the Era of CLOUD
“More firms will adopt Amazon EC2 or EMR or
Google App Engine platforms for data analytics.
Put in a credit card, by an hour or months worth
of compute and storage data. Charge for what
you use. No sign up period or fee. Ability to fire
up complex analytic systems. Can be a small or
large player” Ravi Kalakota’s forecast
Big Data on the cloud
Image source:practicalanalytics.files.wordpress.com
What Intelligent Means
• “A pair of eyes attached to a human brain can
quickly make sense of the content presented
on a web page and decide whether it has the
answer it’s looking for or not in ways that a
computer can't. Until now.”
― David Amerland, Google Semantic Search
Intelligent Web
Humor Corner
Image source: thebigdatainsightsgroup.com
Big Data Technologies &
Complexity
1. Hadoop Framework – HDFS and MapReduce
2. Hadoop Ecosystem
3. NOSql Databases
In Big data choosing algorithms that has least complexity in terms of processing time is
the most important. Usually we use Big O notation for accessing such complexities.
Big O Notation is the rate at which the performance of the system degrades as a
function of amount of data asked to handle.
For example during sorting operation we should prefer Merge Sort (with time
complexity NlogN) over the Insertion sort O(N^2).
How to find the Big O for any given polynomial?
The steps are
• Drop constants
• Drop Coefficients
• Keep only the highest order term. The exponent of that is the complexity.
For example 3n^3 has a cubic degradation.
Big Data Challenge
Challenge: CAP Theorem
CAP stands for consistency , Availability and Partition Tolerance. These 3 are
important properties in the big data space. However the theorem states that we can
get only two out of these three. So we are forced to relax on the consistency aspect
because the Availability and partition tolerance are critical to the big data world.
Availability: If you can talk to a node in the cluster it can read and write data.
Partition Tolerance: The cluster can survive communication breakages.
Cluster Based approach
Single Large CPU (Supercomputer)– Failure is possibility, More Cost, Vertical Scaling
Multiple Nodes (Commodity Hardware) – Fault tolerance using Replication, Less Cost,
Horizontal Scaling (sharding).
Also we have two variants in cluster based approach.
Parallel Computing and Distributed Computing.
Parallel Computing has multiple CPU’s with a shared memory and Distributed Computing
has multiple CPU’s with one memory for each of the nodes.
While choosing the algorithms in concurrent processing we consider the following factors,
• Granularity: Specifies the number of tasks in which job is done.
Has two types Fine Grained and Coarse Grained.
Fine Grained: Large number of small tasks
Coarse Grained: Small number of large tasks
• Degree of Concurrency: Higher the avg degree of concurrency the better because of
proper utilization of clusters.
• Critical Path length: The longest directed path in a task dependency graph.
RDBMS Vs NOSql
RDBMS suffers from Impedance Mismatch problems and the database in
integrated. It is not designed to run efficiently on clusters. It is normalized
with a well defined schema. This makes it less flexible to adapt to newer
requirements of processing large data in a less time.
Natural movement shifted the integration databases to application
oriented databases integrated through services.
NOSql emerged with Polyglot persistence having a schema-less design and
well suitable for clusters.
Now the Database stack looks like this:
RDBMS, Key-Value, Document, Column Family stores and Graph
Databases. We need to choose one of these based on our requirements.
RDBMS – Data storage in Tuples (Limited Data Structure)
NOSql - Aggregates (Complex Data Structure)
NOSql Family
 Key value Databases – Voldemort, Redis,Riak to name a few
The Aggregate is opaque to the database. Just some big blob of mostly
meaningless bits. We can access the aggregate by lookup based on some
key.
 Document Databases – CouchDB and MongoDB
The Aggregate has some structure.
 Column Family Stores – HBase, Cassandra, Amazon SimpleDB
Two level aggregats structure. Rows and columns. Organized columns into
column families. Row key is row identifier. Column key and Column value
together forms the Column family. Example of column family is profile of a
customer, orders done by a customer. Within each cell there is a timestamp.
 Graph Databases – Neo4j, FlockDB, Infinite Graph
Suitable to model complex relationships. This is not aggregate oriented.
Why Hadoop
Yahoo
– Before Hadoop: 1 million for 10 TB storage
– With Hadoop: $1 million for 1 PB of storage
Other Large Company
– Before Hadoop: $5 million to store data in Oracle
– With Hadoop: $240k to store the data in HDFS
Facebook
– Hadoop as unified storage
Case study: Netflix
Before Hadoop
– Nightly processing of logs
– Imported into a database
– Analysis/BI
As data volume grew, it took more than 24 hours to process and
load a day’s worth of logs
Today, an hourly Hadoop job processes logs for quicker
availability to the data for analysis/BI
Currently ingesting approx. 1TB/day
Hadoop Stack Diagram
Hardware
Software
Environment
Application
Commodity Cluster
MapReduce - HDFS
Ecosystem | Custom Applications
Core Components - HDFS
HDFS - Data files are split into blocks of 64 or 128 MB and distributed across multiple
nodes in the cluster. No random writes or reads are allowed in HDFS.
Each Map operates on one HDFS data block. Mapper reads data in Key Value pairs.
5 daemons - Name Node ,Secondary Name Node, Data Node, Job tracker and Task Tracker.
•Data Node sends Heartbeats
•Every 10th heartbeat is a Block report
•Name Node builds metadata from Block reports
•TCP – every 3 seconds
•If Name Node is down, HDFS is down
HDFS takes care of load balancing.
Core Component - MapReduce
Map Reduce pattern is pattern to allow computations to be parallelized over a cluster.
Example of Map Reduce:
Map Stage:
Suppose say we have order as aggregate and sales people want to see the product
level report and its total revenue for the last week.
So order aggregate will be input the Map and Key Value pairs (corresponding line
items). Key will be the product id and (quantity, price) will be the values.
Map operation will only operate on a single record at a time and hence can be
parallelized.
Reduce Stage:
Takes multiple map outputs with the same key and combines their values.
Number of mappers decided by the block and data size and number of reducers
decided by the programmer.
Widely used MapReduce computations can be stored as Materialized Views.
MapReduce Limitations
• Computation depends on previously computed values.
Ex: Fibonacci series
• Algorithms that depend on shared global state.
Ex: Monte Carlo Simulation
• Join Algorithms for Log Processing
Mapreduce framework is cumbersome for joins.
Fault Tolerance and Speed
•One server may stay up 3 years (1,000 days) .If you have 1,0000 servers, expect
to lose 1/day .
•So go for replication of data. Since high volume and velocity it’s impossible to
move data. So Bring computation to the data and not vice versa. Also this will
minimize network congestion.
•Since nodes fail we go for distributed file system. 64 MB Blocks, 3X Replication,
Different racks storage.
•Increase speed of processing – Speculative Execution.
Ecosystem 1 &2- Hive and Pig
Hive:
•An SQL-like interface to Hadoop
•Abstraction to the Mapreduce as it is complex. This is just a view.
•This just provides a tabular view and doesn’t create any tables.
•Hive Query Language (HQL) is converted to a set of tokens which in turn gets
converted to Map Reduce jobs internally before getting executed on the
hadoop clusters.
Pig:
•A dataflow language for transforming large data sets.
•Pig Scripts resides on the user machine and the job executes on the cluster
whereas Hive resides inside the hadoop cluster.
Easy syntax structure such as Load, Filter, Group By, For Each, Store etc.,
Ecosystem 3 - HBase
HBase is a column family-store database layered on top of HDFS
Provides Column oriented view of the data sitting on HDFS.
Can store massive amounts of data
– Multiple Terabytes, up to Petabytes of data
High Write Throughput
– Scales up to millions of writes per second
Copes well with sparse data
– Tables can have many thousands of columns
– Even if a given row only has data in a few of the columns
Use HBase if…
– You need random write, random read, or both (but not neither)
– You need to do many thousands of operations per second on
multiple TB of data
Used in Twitter, Facebook etc.,
Ecosystem 4 - Mahout
Machine learning library on top of hadoop. This is scalable and
efficient. Useful for predictive analysis where in deriving
information from current and historical data (mainly large
dataset) to derive meaningful insights from data.
The implementations are
Recommendation systems – Implementing recommendations
with the use of techniques like Collaborative Filtering.
Classification – Supervised Learning using techniques like
Decision trees, KNN etc.,
Clustering – Type of Unsupervised learning using techniques
like K-Means and also some distance based metrics.
Frequent item sets Mining – Finding the patterns in customer
purchase and then finding the correlation of purchases with
respect to support, confidence and lift.
Other Ecosystems
Oozie
Specifies workflows when there is a complex map reduce job.
Zookeeper
Co-ordination among multiple servers with multiple machines sitting at multiple locations.
Maintains configuration information, distributed synchronization etc.,
Flume
Cheap storage of all the log files from all the servers.
Chukwa
Scalable log collector collecting the logs and dumping to HDFS. No storage or processing done.
Sqoop
Imports structured data into HDFS and also exports back the results from HDFS to RDBMS.
Whirr
This is a set of libraries for running cloud services. Today configuration is specific to a provider. We can use
whirr if we need to port to a different service provider. For example from Amazon S3 to Rackspace.
Hama and BSP
This is alternative to specifically for graph processing applications. BSP is a parallel programming model.
mapreduce
Closing Note
Big Data is a reality now and firms are sitting on huge chunk of data to be processed,
mined and then to extract meaningful insights generating revenue to the business.
This will provide competitive advantage to the companies in serving the customers through
the Entire life cycle (Acquisition, Retention, Relationship enhancement etc.,) transforming
Prospects to Customers and Customer to Brand Ambassadors.
Innovation in many fields such as Computer science, Statistics, Machine Learning,
Programming, Management thinking, Psychology has come together to address this
current challenge of Big Data Industry.
To quote a few advances in the technology field,
Reduction of computation cost, Increase in computation power, Decrease
cost of storage and availability of commodity hardware.
Also coming up of new roles such as Data Stewards and Scientists in this industry to
cater to the challenges and come up with appropriate solutions.
This is an active field involving research and will evolve over the future years.

More Related Content

What's hot

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
Revolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 

What's hot (20)

Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence Applications
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
R for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesR for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two Strategies
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Data Analytics Domain
Data Analytics DomainData Analytics Domain
Data Analytics Domain
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 

Similar to Big data business case

Similar to Big data business case (20)

Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 

Big data business case

  • 1. "We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.“-Amara’s Law. “The Best Way to Predict the Future is to Create it”. – Peter Drucker. Author – Karthik Padmanabhan Deputy Manager, Research and Advanced Engineering, Ford Motor Company. Image source: isa.org
  • 2. What is Big Data Data will not be termed as Big Data based on Size alone. There are multiple factors to it. 1. Volume – Corporate Data has grown to Peta-byte level and still growing… 2. Velocity –Data is changing each moment and its history needs to be tracked as well as planning on its utilization 3. Variety – Data coming from wide variety of sources and in various formats. Now people include one more aspect to it called Veracity. So essentially what it means is that the dimensionality of the term Big Data is increasing by the day. So finally the curse of dimensionality comes into play wherein it becomes that we cannot use normal methods to really get some insights from the data. Innovation is the key to handle such a beast.
  • 3. Dimensions of Big Data Image source: datasciencecentral.com
  • 5.
  • 6. Life Cycle – From Start to End Image source: doubleclix.wordpress.com
  • 7. Need for Big Data Brutal Fact - 80% of data is unstructured data and it is growing at 15% annually. So in the next 2 years the data size will double as of today. Single View: Integrated analysis of the customer and transaction data. E-Commerce Business : Storing huge amount of click-stream data. Need to measure the entire Digital Footprint. Text Processing applications: Social Media text mining. Here the entire landscape changes as it involves different set of metrics at higher dimensions which increases the complexity of the application. Distributed Processing is the way to go. Real Time Actionability: Immediate feedback on product launch through analysis of social media comments instead of waiting for customer satisfaction survey. Change in Consumer Psychology: Necessity for Instant Gratification.
  • 8. The 2013 Gartner Hype Cycle Special Report evaluates the maturity of over 2,000 technologies and trends in 102 areas. New Hype Cycles this year feature content and social analytics, embedded software and systems, consumer market research, open banking, banking operations innovation, and ICT in Africa. http://www.gartner.com/technology/research/hype-cycles/ Hype Cycles - Gartner
  • 9. Big Data – Where it is used • Work-Force Science • Astronomy (Hubble telescope) • Gene and DNA expressions. • Finding out cancerous cells which causes a disease. • Fraud detection • Video and Audio Mining • Automotive Industry – We will see use cases in the next slide. • Consumer focused marketing • Retail
  • 10. Automotive Industry “If the automobile had followed the same development cycle as the computer, a Rolls- Royce would today cost $100, get a million miles per gallon, and explode once a year, killing everyone inside.” – Robert X. Cringely
  • 11. Use Cases - Automotive • Vehicle Insurance • Personalized Travel & Shopping Guidance Systems • Supply Chain/Logistics • Auto-Repairs • Vehicle Engineering • Vehicle Warranty • Customer Sentiment • Customer Care Call Centers
  • 12. Use Cases - Automotive • Self Driving Cars (Perspective)  Sensor Data generates 1 GB of Data per second  A Person drives 600 hours per year on Average.  2,160,000 (600*60*60) seconds  2 Petabyte of data per car per year.  Total Number of cars in the world to surpass 1 billion.   SO you can do the MATH  • Smart Parking using Mesh Networks – Visuals in the next slide.
  • 13. Smart parking – Mesh Networks
  • 14. Datacenter- Where Data Resides • “eBay Keynote – 90M Users, 245M items on sale, 93B active database calls…that’s a busy DATACENTER” - Gartner DC
  • 15. Google Datacenter – Over the years
  • 16. Welcome to the Era of CLOUD “More firms will adopt Amazon EC2 or EMR or Google App Engine platforms for data analytics. Put in a credit card, by an hour or months worth of compute and storage data. Charge for what you use. No sign up period or fee. Ability to fire up complex analytic systems. Can be a small or large player” Ravi Kalakota’s forecast
  • 17. Big Data on the cloud Image source:practicalanalytics.files.wordpress.com
  • 18.
  • 19.
  • 20. What Intelligent Means • “A pair of eyes attached to a human brain can quickly make sense of the content presented on a web page and decide whether it has the answer it’s looking for or not in ways that a computer can't. Until now.” ― David Amerland, Google Semantic Search
  • 22. Humor Corner Image source: thebigdatainsightsgroup.com
  • 23. Big Data Technologies & Complexity 1. Hadoop Framework – HDFS and MapReduce 2. Hadoop Ecosystem 3. NOSql Databases In Big data choosing algorithms that has least complexity in terms of processing time is the most important. Usually we use Big O notation for accessing such complexities. Big O Notation is the rate at which the performance of the system degrades as a function of amount of data asked to handle. For example during sorting operation we should prefer Merge Sort (with time complexity NlogN) over the Insertion sort O(N^2). How to find the Big O for any given polynomial? The steps are • Drop constants • Drop Coefficients • Keep only the highest order term. The exponent of that is the complexity. For example 3n^3 has a cubic degradation.
  • 24. Big Data Challenge Challenge: CAP Theorem CAP stands for consistency , Availability and Partition Tolerance. These 3 are important properties in the big data space. However the theorem states that we can get only two out of these three. So we are forced to relax on the consistency aspect because the Availability and partition tolerance are critical to the big data world. Availability: If you can talk to a node in the cluster it can read and write data. Partition Tolerance: The cluster can survive communication breakages.
  • 25. Cluster Based approach Single Large CPU (Supercomputer)– Failure is possibility, More Cost, Vertical Scaling Multiple Nodes (Commodity Hardware) – Fault tolerance using Replication, Less Cost, Horizontal Scaling (sharding). Also we have two variants in cluster based approach. Parallel Computing and Distributed Computing. Parallel Computing has multiple CPU’s with a shared memory and Distributed Computing has multiple CPU’s with one memory for each of the nodes. While choosing the algorithms in concurrent processing we consider the following factors, • Granularity: Specifies the number of tasks in which job is done. Has two types Fine Grained and Coarse Grained. Fine Grained: Large number of small tasks Coarse Grained: Small number of large tasks • Degree of Concurrency: Higher the avg degree of concurrency the better because of proper utilization of clusters. • Critical Path length: The longest directed path in a task dependency graph.
  • 26. RDBMS Vs NOSql RDBMS suffers from Impedance Mismatch problems and the database in integrated. It is not designed to run efficiently on clusters. It is normalized with a well defined schema. This makes it less flexible to adapt to newer requirements of processing large data in a less time. Natural movement shifted the integration databases to application oriented databases integrated through services. NOSql emerged with Polyglot persistence having a schema-less design and well suitable for clusters. Now the Database stack looks like this: RDBMS, Key-Value, Document, Column Family stores and Graph Databases. We need to choose one of these based on our requirements. RDBMS – Data storage in Tuples (Limited Data Structure) NOSql - Aggregates (Complex Data Structure)
  • 27. NOSql Family  Key value Databases – Voldemort, Redis,Riak to name a few The Aggregate is opaque to the database. Just some big blob of mostly meaningless bits. We can access the aggregate by lookup based on some key.  Document Databases – CouchDB and MongoDB The Aggregate has some structure.  Column Family Stores – HBase, Cassandra, Amazon SimpleDB Two level aggregats structure. Rows and columns. Organized columns into column families. Row key is row identifier. Column key and Column value together forms the Column family. Example of column family is profile of a customer, orders done by a customer. Within each cell there is a timestamp.  Graph Databases – Neo4j, FlockDB, Infinite Graph Suitable to model complex relationships. This is not aggregate oriented.
  • 28. Why Hadoop Yahoo – Before Hadoop: 1 million for 10 TB storage – With Hadoop: $1 million for 1 PB of storage Other Large Company – Before Hadoop: $5 million to store data in Oracle – With Hadoop: $240k to store the data in HDFS Facebook – Hadoop as unified storage Case study: Netflix Before Hadoop – Nightly processing of logs – Imported into a database – Analysis/BI As data volume grew, it took more than 24 hours to process and load a day’s worth of logs Today, an hourly Hadoop job processes logs for quicker availability to the data for analysis/BI Currently ingesting approx. 1TB/day
  • 29. Hadoop Stack Diagram Hardware Software Environment Application Commodity Cluster MapReduce - HDFS Ecosystem | Custom Applications
  • 30. Core Components - HDFS HDFS - Data files are split into blocks of 64 or 128 MB and distributed across multiple nodes in the cluster. No random writes or reads are allowed in HDFS. Each Map operates on one HDFS data block. Mapper reads data in Key Value pairs. 5 daemons - Name Node ,Secondary Name Node, Data Node, Job tracker and Task Tracker. •Data Node sends Heartbeats •Every 10th heartbeat is a Block report •Name Node builds metadata from Block reports •TCP – every 3 seconds •If Name Node is down, HDFS is down HDFS takes care of load balancing.
  • 31. Core Component - MapReduce Map Reduce pattern is pattern to allow computations to be parallelized over a cluster. Example of Map Reduce: Map Stage: Suppose say we have order as aggregate and sales people want to see the product level report and its total revenue for the last week. So order aggregate will be input the Map and Key Value pairs (corresponding line items). Key will be the product id and (quantity, price) will be the values. Map operation will only operate on a single record at a time and hence can be parallelized. Reduce Stage: Takes multiple map outputs with the same key and combines their values. Number of mappers decided by the block and data size and number of reducers decided by the programmer. Widely used MapReduce computations can be stored as Materialized Views.
  • 32. MapReduce Limitations • Computation depends on previously computed values. Ex: Fibonacci series • Algorithms that depend on shared global state. Ex: Monte Carlo Simulation • Join Algorithms for Log Processing Mapreduce framework is cumbersome for joins.
  • 33. Fault Tolerance and Speed •One server may stay up 3 years (1,000 days) .If you have 1,0000 servers, expect to lose 1/day . •So go for replication of data. Since high volume and velocity it’s impossible to move data. So Bring computation to the data and not vice versa. Also this will minimize network congestion. •Since nodes fail we go for distributed file system. 64 MB Blocks, 3X Replication, Different racks storage. •Increase speed of processing – Speculative Execution.
  • 34. Ecosystem 1 &2- Hive and Pig Hive: •An SQL-like interface to Hadoop •Abstraction to the Mapreduce as it is complex. This is just a view. •This just provides a tabular view and doesn’t create any tables. •Hive Query Language (HQL) is converted to a set of tokens which in turn gets converted to Map Reduce jobs internally before getting executed on the hadoop clusters. Pig: •A dataflow language for transforming large data sets. •Pig Scripts resides on the user machine and the job executes on the cluster whereas Hive resides inside the hadoop cluster. Easy syntax structure such as Load, Filter, Group By, For Each, Store etc.,
  • 35. Ecosystem 3 - HBase HBase is a column family-store database layered on top of HDFS Provides Column oriented view of the data sitting on HDFS. Can store massive amounts of data – Multiple Terabytes, up to Petabytes of data High Write Throughput – Scales up to millions of writes per second Copes well with sparse data – Tables can have many thousands of columns – Even if a given row only has data in a few of the columns Use HBase if… – You need random write, random read, or both (but not neither) – You need to do many thousands of operations per second on multiple TB of data Used in Twitter, Facebook etc.,
  • 36. Ecosystem 4 - Mahout Machine learning library on top of hadoop. This is scalable and efficient. Useful for predictive analysis where in deriving information from current and historical data (mainly large dataset) to derive meaningful insights from data. The implementations are Recommendation systems – Implementing recommendations with the use of techniques like Collaborative Filtering. Classification – Supervised Learning using techniques like Decision trees, KNN etc., Clustering – Type of Unsupervised learning using techniques like K-Means and also some distance based metrics. Frequent item sets Mining – Finding the patterns in customer purchase and then finding the correlation of purchases with respect to support, confidence and lift.
  • 37. Other Ecosystems Oozie Specifies workflows when there is a complex map reduce job. Zookeeper Co-ordination among multiple servers with multiple machines sitting at multiple locations. Maintains configuration information, distributed synchronization etc., Flume Cheap storage of all the log files from all the servers. Chukwa Scalable log collector collecting the logs and dumping to HDFS. No storage or processing done. Sqoop Imports structured data into HDFS and also exports back the results from HDFS to RDBMS. Whirr This is a set of libraries for running cloud services. Today configuration is specific to a provider. We can use whirr if we need to port to a different service provider. For example from Amazon S3 to Rackspace. Hama and BSP This is alternative to specifically for graph processing applications. BSP is a parallel programming model. mapreduce
  • 38. Closing Note Big Data is a reality now and firms are sitting on huge chunk of data to be processed, mined and then to extract meaningful insights generating revenue to the business. This will provide competitive advantage to the companies in serving the customers through the Entire life cycle (Acquisition, Retention, Relationship enhancement etc.,) transforming Prospects to Customers and Customer to Brand Ambassadors. Innovation in many fields such as Computer science, Statistics, Machine Learning, Programming, Management thinking, Psychology has come together to address this current challenge of Big Data Industry. To quote a few advances in the technology field, Reduction of computation cost, Increase in computation power, Decrease cost of storage and availability of commodity hardware. Also coming up of new roles such as Data Stewards and Scientists in this industry to cater to the challenges and come up with appropriate solutions. This is an active field involving research and will evolve over the future years.