 Big Data refers to massive, often unstructured data that is beyond
the processing capabilities of traditional data manag...
 Volume refers to huge amount of data
being generated every minute.
 90% of the data we have now is created in
just past...
 Velocity refers to SPEED at which new
data is being generated and moves around.
 It includes Real time working systems
...
 Variety refers to various datatypes
which we can now use.
 Earlier focus was on neat and
structured data kept in form o...
Transform problems into possibilities
 It is the process of examining large amounts of data of a variety of
types (big data) to uncover hidden patterns, unknow...
 Descriptive
 Diagnostic
 Predictive
 Prescriptive
 Relational databases failed to store and process Big Data.
 As a result, a new class of big data technology has emerged...
 Hadoop is a open source framework
 Java-based programming framework
 Processing and storing of large data sets
 Distr...
 HDFS stores data in DISTRIBUTED,SCALABLE and FAULT-
TOLERANT WAY.
 Name node have metadata about data on DataNodes
 Da...
Hadoop SQL
 Data is stored in
form of compressed
files across n number
of commodity servers
 Data is stored in
form of t...
 Copying same file over all (thousands) of nodes ?
doesn’t it seem like wastage of space !
 It actually is not a waste m...
 MapReduce is a programming model designed for processing
large volumes of data in parallel by dividing the work into a s...
 Mapper function maps the split files and provide input to reducer
 Mapper ( filename , file –contents):
for each word i...
 There were 2 major disadvantages when hadoop was developed
which now have been dissolved
 HDFS dependency on single Nam...
 Not only SQL
 Non- relational database management system
 Used where no fix schemas are required and data is scaled
ho...
 KEY-VALUE PAIR
 keys used to get
Value from opaque
Data blocks
 Hash map
 Tremendously fast
Drawback:
No provision fo...
 DOCUMENT DATABASE
• Again a key value store but value is in
form of document.
• Documents are not of fixed schemas
• doc...
 COLUMNAR DATABASE
 Works on attributes rather
than tuples
 Key here is column name
and value is contiguous
column valu...
 GRAPH DATABASES
• Is a collection of nodes
and edges
• Nodes represent data
while edge represent
link between them
• Mos...
 Websites :
• http://searchbusinessanalytics.techtarget.com/
Experts sound off on big data , Analytics and its tools
• ht...
Data is the new oil
Without Big data analysis companies are deaf
and dumb , mere wanderers on web ... Like a
cattle on the...
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Upcoming SlideShare
Loading in...5
×

Big data analytics: Technology's bleeding edge

2,633
-1

Published on

There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.

Published in: Technology
2 Comments
9 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,633
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
338
Comments
2
Likes
9
Embeds 0
No embeds

No notes for slide

Big data analytics: Technology's bleeding edge

  1. 1.  Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional data management tools.  Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.  Traditional relational database management systems cannot deal with such large masses of data.  Examples : User updates over fb. Clicks over the internet.
  2. 2.  Volume refers to huge amount of data being generated every minute.  90% of the data we have now is created in just past 2 years.  IP traffic by 2015 would turn 4X than what it is now.  3 billion people would be online by 2015 .
  3. 3.  Velocity refers to SPEED at which new data is being generated and moves around.  It includes Real time working systems such as Online banking.  Need of low response time.  Technology “In-Memory Analytics” is employed to deal with data in motion.
  4. 4.  Variety refers to various datatypes which we can now use.  Earlier focus was on neat and structured data kept in form of tables in RDBMS.  80% of data available now is unstructured data  Datatypes are anomalous varying from text to videos to audios to pictures.
  5. 5. Transform problems into possibilities
  6. 6.  It is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other real- time insights.  Use of Big Data Analytics – Google Search recommendations, Satyamev jayte, Genes reading Data Mining Big data Analytics Data constraints like data must be neat and clean  Big data can not be neat as it is unstructured  Elaborate ETL required thus have to wait for completion of ETL cycle for insights.  Big data analytics provide real – time insights.
  7. 7.  Descriptive  Diagnostic  Predictive  Prescriptive
  8. 8.  Relational databases failed to store and process Big Data.  As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.  The technologies associated with big data analytics include  Hadoop  Mapreduce  NoSQL
  9. 9.  Hadoop is a open source framework  Java-based programming framework  Processing and storing of large data sets  Distributed computing environment.  Components of hadoop  HDFS( hadoop distributed file system)  Mapreduce
  10. 10.  HDFS stores data in DISTRIBUTED,SCALABLE and FAULT- TOLERANT WAY.  Name node have metadata about data on DataNodes  DataNodes actually have data on them in form of blocks and they are capable of communicating
  11. 11. Hadoop SQL  Data is stored in form of compressed files across n number of commodity servers  Data is stored in form of tables and columns with relation in them  Fault tolerant – if one node fails ,system still work  If any one node crashes ,it gives error so as to maintain consistency Any questions ???...
  12. 12.  Copying same file over all (thousands) of nodes ? doesn’t it seem like wastage of space !  It actually is not a waste memory, because of 2 reasons:  If one node failed ,System would still work as data is never lost.  The query is scaled over nodes so it bring about faster results due to parallel processing eg- Select the count of word ‘happy’ on twitter. The query is split across multiple servers with a criteria (here months), and the results are consolidated.
  13. 13.  MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. as in previous example twitter data was processed on different servers on basis of months .  Hadoop is the physical implementation of Mapreduce .  It is combination of 2 java functions : Mapper() and Reducer()  example: to check popularity of text. use of word-count..
  14. 14.  Mapper function maps the split files and provide input to reducer  Mapper ( filename , file –contents): for each word in file-contents: emit (word , 1)  Reducer function clubs the input provided by mapper and produce output  Reducer ( word , values): sum=0; for each value in values: sum=sum + value emit(word , sum) can anyone think of any disadvantages??..
  15. 15.  There were 2 major disadvantages when hadoop was developed which now have been dissolved  HDFS dependency on single Namenode solution: A secondary Namenode is attached to Primary Namenode  MapReduce is a java fraamework and did not support sql queries solution: Facebook developed HIVE which allowed scientists work with sql on distributed database.
  16. 16.  Not only SQL  Non- relational database management system  Used where no fix schemas are required and data is scaled horizontally.  4 Categories of Nosql databases:  Key-value pair  Columnar database  Graph databases  Document databases
  17. 17.  KEY-VALUE PAIR  keys used to get Value from opaque Data blocks  Hash map  Tremendously fast Drawback: No provision for content based queries .
  18. 18.  DOCUMENT DATABASE • Again a key value store but value is in form of document. • Documents are not of fixed schemas • documents can be nested • Queries based on content as well as keys • Use cases: blogging websites
  19. 19.  COLUMNAR DATABASE  Works on attributes rather than tuples  Key here is column name and value is contiguous column values  Best for aggregation queries  Trend : select (1 or 2 column’s values ) where ( same or the other column value ) = some value.
  20. 20.  GRAPH DATABASES • Is a collection of nodes and edges • Nodes represent data while edge represent link between them • Most dynamic and flexible
  21. 21.  Websites : • http://searchbusinessanalytics.techtarget.com/ Experts sound off on big data , Analytics and its tools • http://www.ibmbigdatahub.com/infographic/four-vs-big-data Big data and analytics hub • https://bigdatauniversity.com/bdu-wp/bdu-course/hadoop- fundamentals-i-version-3/ Hadoop fundamentals Research papers : •MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Appeared in: OSDI'04: Sixth Symposium on Operating System Design San Francisco, CA, December, 2004.
  22. 22. Data is the new oil Without Big data analysis companies are deaf and dumb , mere wanderers on web ... Like a cattle on the highway ! Thank you ! Keep dreaming BIG :D
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×