2. Introduction
• Big Data is Data that is hard to capture, store, and analyze with
commonly used software tools due to its very large size
• “World’s nervous system—a real-time feedback loop which didn’t
exist before” - Yahoo CEO Marissa Mayer
3. • Mobile devices, smart energy meters, remote sensing,
wireless sensors, software machine logs, cameras, rfid
readers, etc. are creating massive amounts of data
that businesses & governments now have the
opportunity to analyze and act upon.
• Every day approx 2.5 quintillion (2.5×10^18) bytes of
data is created.
• Business and economic possibilities of big data and its
wider implications are important issues that business
leaders and policy makers will tackle in the years
ahead
Why you should care?
4. Industry verticals using Big Data
Digital Media & E-Commerce Real-time ad targeting, Web analytics & trends
Energy and Utilities Smart meter analytics, Asset management
Financial Services Risk and fraud management, Portfolio
management, Customer analytics
Government Threat Management, Law Enforcement (Real-time
multimodal surveillance, Cyber security detection),
Macro economic analytics
Healthcare and Life Sciences New drug development, Medical record text
analytics, Genomic analytics
Retail CRM, Targeted marketing analysis, Vendor delivery
& Supply chain optimizations, Market basket
analysis, Click-stream analysis
Telecommunications CRM, Call detail record analysis, Least cost routing,
Fraud management
Transportation Logistics optimization, Traffic congestion
Any industry vertical which accumulates a sufficient quantity of data can leverage
Big data technologies. Here are some of the verticals
6. Big Data Process/Steps
Data processing steps at a basic level can be broken into
three stages. Data as being raw indicators, information
as the meaningful interpretation of those signals, and
insight as an actionable piece of knowledge.
7. • Consider 10 million page views a day on a popular
web site
• Capture User id for every page view and store them as
integer
• 10 million x 4 bytes = 40 MB of storage/day
• 40MB x 30 days = 1.17 GB/month
• Data quickly grows and so does challenges around
storage, processing and analytics.
Why Web Analytics quickly leads to Big Data Science
10^7 elements
Domain of 32 –
bit integers
40MB / day
8. New Algorithm techniques in traditional computing
• Probabilistic Data structures
• Cardinality Estimation, Frequency Estimation, Range Query,
Membership Query etc.
Distributed computing /Divide and Conquer
• Break processing units into equal parts, get individual results, and
aggregate
• Distributed systems are complex to build and maintain
• Depended on academia & research labs for renting compute
Dealing with large datasets
9. Traditional Distributed system challenges
Data exchange requires synchronization
Temporal dependencies are complicated
Difficult to deal with partial failures of the system
Mostly at compute time, data is copied to the compute nodes
Developers spend more time designing for failure than they do actually
working on the problem itself
Transferring data to compute nodes becomes a bottleneck
• Typical disk data transfer rate: 75MB/sec -- Time taken to
transfer 100GB of data to the processor: approx 22 mins.
New approach is needed
10. Ideal system for distributed computing
Partial failure support
Data recoverability
Component recoverability
Consistency
Scalability