2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Big data business case
1. "We tend to overestimate the effect of a technology in the short run
and underestimate the effect in the long run.“-Amara’s Law.
“The Best Way to Predict the Future is to Create it”. – Peter Drucker.
Author – Karthik Padmanabhan
Deputy Manager, Research and Advanced Engineering,
Ford Motor Company.
Image source: isa.org
2. What is Big Data
Data will not be termed as Big Data based on Size alone. There are multiple
factors to it.
1. Volume – Corporate Data has grown to Peta-byte level and still growing…
2. Velocity –Data is changing each moment and its history needs to be tracked
as well as planning on its utilization
3. Variety – Data coming from wide variety of sources and in various formats.
Now people include one more aspect to it called Veracity. So essentially what it
means is that the dimensionality of the term Big Data is increasing by the day.
So finally the curse of dimensionality comes into play wherein it becomes that
we cannot use normal methods to really get some insights from the data.
Innovation is the key to handle such a beast.
6. Life Cycle – From Start to End
Image source: doubleclix.wordpress.com
7. Need for Big Data
Brutal Fact - 80% of data is unstructured data and it is growing at 15% annually. So in
the next 2 years the data size will double as of today.
Single View: Integrated analysis of the customer and transaction data.
E-Commerce Business : Storing huge amount of click-stream data. Need to measure
the entire Digital Footprint.
Text Processing applications: Social Media text mining. Here the entire landscape
changes as it involves different set of metrics at higher dimensions which increases
the complexity of the application. Distributed Processing is the way to go.
Real Time Actionability: Immediate feedback on product launch through analysis of
social media comments instead of waiting for customer satisfaction survey.
Change in Consumer Psychology: Necessity for Instant Gratification.
8. The 2013 Gartner Hype Cycle Special Report evaluates the maturity of over 2,000 technologies
and trends in 102 areas. New Hype Cycles this year feature content and social analytics,
embedded software and systems, consumer market research, open banking, banking operations
innovation, and ICT in Africa.
http://www.gartner.com/technology/research/hype-cycles/
Hype Cycles - Gartner
9. Big Data – Where it is used
• Work-Force Science
• Astronomy (Hubble telescope)
• Gene and DNA expressions.
• Finding out cancerous cells which causes a disease.
• Fraud detection
• Video and Audio Mining
• Automotive Industry – We will see use cases in the
next slide.
• Consumer focused marketing
• Retail
10. Automotive Industry
“If the automobile had followed the same
development cycle as the computer, a Rolls-
Royce would today cost $100, get a million miles
per gallon, and explode once a year, killing
everyone inside.”
– Robert X. Cringely
12. Use Cases - Automotive
• Self Driving Cars (Perspective)
Sensor Data generates 1 GB of Data per second
A Person drives 600 hours per year on Average.
2,160,000 (600*60*60) seconds 2 Petabyte of data per car
per year.
Total Number of cars in the world to surpass 1 billion.
SO you can do the MATH
• Smart Parking using Mesh Networks – Visuals in
the next slide.
16. Welcome to the Era of CLOUD
“More firms will adopt Amazon EC2 or EMR or
Google App Engine platforms for data analytics.
Put in a credit card, by an hour or months worth
of compute and storage data. Charge for what
you use. No sign up period or fee. Ability to fire
up complex analytic systems. Can be a small or
large player” Ravi Kalakota’s forecast
17. Big Data on the cloud
Image source:practicalanalytics.files.wordpress.com
18.
19.
20. What Intelligent Means
• “A pair of eyes attached to a human brain can
quickly make sense of the content presented
on a web page and decide whether it has the
answer it’s looking for or not in ways that a
computer can't. Until now.”
― David Amerland, Google Semantic Search
23. Big Data Technologies &
Complexity
1. Hadoop Framework – HDFS and MapReduce
2. Hadoop Ecosystem
3. NOSql Databases
In Big data choosing algorithms that has least complexity in terms of processing time is
the most important. Usually we use Big O notation for accessing such complexities.
Big O Notation is the rate at which the performance of the system degrades as a
function of amount of data asked to handle.
For example during sorting operation we should prefer Merge Sort (with time
complexity NlogN) over the Insertion sort O(N^2).
How to find the Big O for any given polynomial?
The steps are
• Drop constants
• Drop Coefficients
• Keep only the highest order term. The exponent of that is the complexity.
For example 3n^3 has a cubic degradation.
24. Big Data Challenge
Challenge: CAP Theorem
CAP stands for consistency , Availability and Partition Tolerance. These 3 are
important properties in the big data space. However the theorem states that we can
get only two out of these three. So we are forced to relax on the consistency aspect
because the Availability and partition tolerance are critical to the big data world.
Availability: If you can talk to a node in the cluster it can read and write data.
Partition Tolerance: The cluster can survive communication breakages.
25. Cluster Based approach
Single Large CPU (Supercomputer)– Failure is possibility, More Cost, Vertical Scaling
Multiple Nodes (Commodity Hardware) – Fault tolerance using Replication, Less Cost,
Horizontal Scaling (sharding).
Also we have two variants in cluster based approach.
Parallel Computing and Distributed Computing.
Parallel Computing has multiple CPU’s with a shared memory and Distributed Computing
has multiple CPU’s with one memory for each of the nodes.
While choosing the algorithms in concurrent processing we consider the following factors,
• Granularity: Specifies the number of tasks in which job is done.
Has two types Fine Grained and Coarse Grained.
Fine Grained: Large number of small tasks
Coarse Grained: Small number of large tasks
• Degree of Concurrency: Higher the avg degree of concurrency the better because of
proper utilization of clusters.
• Critical Path length: The longest directed path in a task dependency graph.
26. RDBMS Vs NOSql
RDBMS suffers from Impedance Mismatch problems and the database in
integrated. It is not designed to run efficiently on clusters. It is normalized
with a well defined schema. This makes it less flexible to adapt to newer
requirements of processing large data in a less time.
Natural movement shifted the integration databases to application
oriented databases integrated through services.
NOSql emerged with Polyglot persistence having a schema-less design and
well suitable for clusters.
Now the Database stack looks like this:
RDBMS, Key-Value, Document, Column Family stores and Graph
Databases. We need to choose one of these based on our requirements.
RDBMS – Data storage in Tuples (Limited Data Structure)
NOSql - Aggregates (Complex Data Structure)
27. NOSql Family
Key value Databases – Voldemort, Redis,Riak to name a few
The Aggregate is opaque to the database. Just some big blob of mostly
meaningless bits. We can access the aggregate by lookup based on some
key.
Document Databases – CouchDB and MongoDB
The Aggregate has some structure.
Column Family Stores – HBase, Cassandra, Amazon SimpleDB
Two level aggregats structure. Rows and columns. Organized columns into
column families. Row key is row identifier. Column key and Column value
together forms the Column family. Example of column family is profile of a
customer, orders done by a customer. Within each cell there is a timestamp.
Graph Databases – Neo4j, FlockDB, Infinite Graph
Suitable to model complex relationships. This is not aggregate oriented.
28. Why Hadoop
Yahoo
– Before Hadoop: 1 million for 10 TB storage
– With Hadoop: $1 million for 1 PB of storage
Other Large Company
– Before Hadoop: $5 million to store data in Oracle
– With Hadoop: $240k to store the data in HDFS
Facebook
– Hadoop as unified storage
Case study: Netflix
Before Hadoop
– Nightly processing of logs
– Imported into a database
– Analysis/BI
As data volume grew, it took more than 24 hours to process and
load a day’s worth of logs
Today, an hourly Hadoop job processes logs for quicker
availability to the data for analysis/BI
Currently ingesting approx. 1TB/day
30. Core Components - HDFS
HDFS - Data files are split into blocks of 64 or 128 MB and distributed across multiple
nodes in the cluster. No random writes or reads are allowed in HDFS.
Each Map operates on one HDFS data block. Mapper reads data in Key Value pairs.
5 daemons - Name Node ,Secondary Name Node, Data Node, Job tracker and Task Tracker.
•Data Node sends Heartbeats
•Every 10th heartbeat is a Block report
•Name Node builds metadata from Block reports
•TCP – every 3 seconds
•If Name Node is down, HDFS is down
HDFS takes care of load balancing.
31. Core Component - MapReduce
Map Reduce pattern is pattern to allow computations to be parallelized over a cluster.
Example of Map Reduce:
Map Stage:
Suppose say we have order as aggregate and sales people want to see the product
level report and its total revenue for the last week.
So order aggregate will be input the Map and Key Value pairs (corresponding line
items). Key will be the product id and (quantity, price) will be the values.
Map operation will only operate on a single record at a time and hence can be
parallelized.
Reduce Stage:
Takes multiple map outputs with the same key and combines their values.
Number of mappers decided by the block and data size and number of reducers
decided by the programmer.
Widely used MapReduce computations can be stored as Materialized Views.
32. MapReduce Limitations
• Computation depends on previously computed values.
Ex: Fibonacci series
• Algorithms that depend on shared global state.
Ex: Monte Carlo Simulation
• Join Algorithms for Log Processing
Mapreduce framework is cumbersome for joins.
33. Fault Tolerance and Speed
•One server may stay up 3 years (1,000 days) .If you have 1,0000 servers, expect
to lose 1/day .
•So go for replication of data. Since high volume and velocity it’s impossible to
move data. So Bring computation to the data and not vice versa. Also this will
minimize network congestion.
•Since nodes fail we go for distributed file system. 64 MB Blocks, 3X Replication,
Different racks storage.
•Increase speed of processing – Speculative Execution.
34. Ecosystem 1 &2- Hive and Pig
Hive:
•An SQL-like interface to Hadoop
•Abstraction to the Mapreduce as it is complex. This is just a view.
•This just provides a tabular view and doesn’t create any tables.
•Hive Query Language (HQL) is converted to a set of tokens which in turn gets
converted to Map Reduce jobs internally before getting executed on the
hadoop clusters.
Pig:
•A dataflow language for transforming large data sets.
•Pig Scripts resides on the user machine and the job executes on the cluster
whereas Hive resides inside the hadoop cluster.
Easy syntax structure such as Load, Filter, Group By, For Each, Store etc.,
35. Ecosystem 3 - HBase
HBase is a column family-store database layered on top of HDFS
Provides Column oriented view of the data sitting on HDFS.
Can store massive amounts of data
– Multiple Terabytes, up to Petabytes of data
High Write Throughput
– Scales up to millions of writes per second
Copes well with sparse data
– Tables can have many thousands of columns
– Even if a given row only has data in a few of the columns
Use HBase if…
– You need random write, random read, or both (but not neither)
– You need to do many thousands of operations per second on
multiple TB of data
Used in Twitter, Facebook etc.,
36. Ecosystem 4 - Mahout
Machine learning library on top of hadoop. This is scalable and
efficient. Useful for predictive analysis where in deriving
information from current and historical data (mainly large
dataset) to derive meaningful insights from data.
The implementations are
Recommendation systems – Implementing recommendations
with the use of techniques like Collaborative Filtering.
Classification – Supervised Learning using techniques like
Decision trees, KNN etc.,
Clustering – Type of Unsupervised learning using techniques
like K-Means and also some distance based metrics.
Frequent item sets Mining – Finding the patterns in customer
purchase and then finding the correlation of purchases with
respect to support, confidence and lift.
37. Other Ecosystems
Oozie
Specifies workflows when there is a complex map reduce job.
Zookeeper
Co-ordination among multiple servers with multiple machines sitting at multiple locations.
Maintains configuration information, distributed synchronization etc.,
Flume
Cheap storage of all the log files from all the servers.
Chukwa
Scalable log collector collecting the logs and dumping to HDFS. No storage or processing done.
Sqoop
Imports structured data into HDFS and also exports back the results from HDFS to RDBMS.
Whirr
This is a set of libraries for running cloud services. Today configuration is specific to a provider. We can use
whirr if we need to port to a different service provider. For example from Amazon S3 to Rackspace.
Hama and BSP
This is alternative to specifically for graph processing applications. BSP is a parallel programming model.
mapreduce
38. Closing Note
Big Data is a reality now and firms are sitting on huge chunk of data to be processed,
mined and then to extract meaningful insights generating revenue to the business.
This will provide competitive advantage to the companies in serving the customers through
the Entire life cycle (Acquisition, Retention, Relationship enhancement etc.,) transforming
Prospects to Customers and Customer to Brand Ambassadors.
Innovation in many fields such as Computer science, Statistics, Machine Learning,
Programming, Management thinking, Psychology has come together to address this
current challenge of Big Data Industry.
To quote a few advances in the technology field,
Reduction of computation cost, Increase in computation power, Decrease
cost of storage and availability of commodity hardware.
Also coming up of new roles such as Data Stewards and Scientists in this industry to
cater to the challenges and come up with appropriate solutions.
This is an active field involving research and will evolve over the future years.