4. What is the aim of the course
Focus is on “Systems” and applications for cloud-based
storage and processing of BIG DATA.
+Big Data - Definition
+Big Data - Analytics
+Big Data - Storage (HDFS)
+Big Data - Computing (Map/Reduce)
+Big Data - Database (HBase)
+Big Data – Graph DB (Titan)
+Big Data - Streaming (Strom)
5. • Get Convinced about “Big Data”
• Understand why we need a different paradigm.
• Ascertain with confidence the need to look at data computing in a different
way.
• Realize the potential of big data
– All of you are skilled enough to get into it.
• What we will not do
– Do research on why things have evolved into the current trends as it stands.
– Try to be hands-on – But not guaranteed
7. What are we going to understand
• What is Big Data?
• Why we landed up there?
• To whom does it matter
• Where is the money?
• Are we ready to handle it?
• What are the concerns?
• Tools and Technologies
– Is Big Data <=> Hadoop
8. Simple to start
• What is the maximum file size you have dealt so far?
– Movies/Files/Streaming video that you have used?
– What have you observed?
• What is the maximum download speed you get?
• Simple computation
– How much time to just transfer.
9. What is big data?
• “Every day, we create 2.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone. This data comes
from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures
and videos, purchase transaction records, and cell
phone GPS signals to name a few.
This data is “big data.”
10. Huge amount of data
• There are huge volumes of data in the world:
+ From the beginning of recorded time until 2003,
+ We created 5 billion gigabytes (exabytes) of data.
+ In 2011, the same amount was created every two days
+ In 2013, the same amount of data is created every 10
minutes.
11. Big data spans three dimensions:
Volume, Velocity and Variety• Volume: Enterprises are awash with ever-growing data of all types, easily amassing
terabytes—even petabytes—of information.
– Turn 12 terabytes of Tweets created each day into improved product sentiment
analysis
– Convert 350 billion annual meter readings to better predict power consumption
• Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching
fraud, big data must be used as it streams into your enterprise in order to maximize its
value.
– Scrutinize 5 million trade events created each day to identify potential fraud
– Analyze 500 million daily call detail records in real-time to predict customer churn
faster
– The latest I have heard is 10 nano seconds delay is too much.
• Variety: Big data is any type of data - structured and unstructured data such as text,
sensor data, audio, video, click streams, log files and more. New insights are found
when analyzing these data types together.
– Monitor 100’s of live video feeds from surveillance cameras to target points of
interest
– Exploit the 80% data growth in images, video and documents to improve customer
satisfaction
12. Finally….
`Big- Data’ is similar to ‘Small-data’ but bigger
.. But having data bigger it requires different
approaches:
Techniques, tools, architecture
… with an aim to solve new problems
Or old problems in a better way
13. Whom does it matter
• Research Community
• Business Community - New tools, new
capabilities, new infrastructure, new business
models etc.,
• On sectors
Financial Services..
15. The Social Layer in an Instrumented Interconnected World
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices sold
annually
76 million smart
meters in 2009…
200M by 2014
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
17. BIG DATA is not just HADOOP
Manage & store huge
volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
18. Types of tools typically used in Big
Data Scenario
• Where is the processing hosted?
– Distributed server/cloud
• Where data is stored?
– Distributed Storage (eg: Amazon s3)
• Where is the programming model?
– Distributed processing (Map Reduce)
• How data is stored and indexed?
– High performance schema free database
• What operations are performed on the data?
– Analytic/Semantic Processing (Eg. RDF/OWL)
19. When dealing with Big Data is hard
• When the operations on data are complex:
– Eg. Simple counting is not a complex problem.
– Modeling and reasoning with data of different kinds can get
extremely complex
• Good news with big-data:
– Often, because of the vast amount of data, modeling
techniques can get simpler (e.g., smart counting can
replace complex model-based analytics)…
– …as long as we deal with the scale.
20. Time for thinking
• What do you do with the data.
– Lets take an example:
• “From application developers to video streamers, organizations of all
sizes face the challenge of capturing, searching, analyzing, and
leveraging as much as terabytes of data per second—too much for the
constraints of traditional system capabilities and database
management tools.”
21. Why Big-Data?
• Key enablers for the appearance and growth
of ‘Big-Data’ are:
+Increase in storage capabilities
+Increase in processing power
+Availability of data