SlideShare a Scribd company logo
1 of 25
Ajay Agarwal
ROLL NO :5201
1
How much time did it take?
· Excel : Have you ever tried a pivot table on 500 MB file?
· SAS/R : Have you ever tried a frequency table on 2 GB file?
· Access: Have you ever tried running a query on 10 GB file
· SQL: Have you ever tried running a query on 50 GB file
2
Can you think of ?
· Can you think of running a query on 20,980,000 GB file.
· What if we get a new data set like this, every day?
· What if we need to execute complex queries on this data set
everyday ?
· Does anybody really deal with this type of data set?
· Is it possible to store and analyze this data?
· Yes Google deals with more than 20 PB data everyday
3
In fact, in a minute
· Email users send more than 204 million messages;
· Mobile Web receives 217 new users;
· Google receives over 2 million search queries;
· YouTube users upload 48 hours of new video;
· Facebook users share 684,000 bits of content;
· Twitter users send more than 100 , 000 tweets;
· Consumers spend $272,000 on Web shopping;
· Apple receives around 47,000 application downloads;
· Brands receive more than 34 , 000 Facebook 'likes';
· Tumblr blog owners publish 27,000 new posts;
· Instagram users share 3 , 600 new photos;
· Flickr users, on the other hand, add 3 , 125 new photos;
· Foursquare users perform 2,000 check-ins;
· WordPress users publish close to 350 new blog posts.
And this is one year back.. Damn!!
4
 Collection of data sets so large and complex that it
becomes difficult to process using on-hand database
management tools or traditional data processing
applications
 “Big Data” is the data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value
and hidden knowledge from it
 ‘ Big Data’ is similar to ‘small data’, but bigger in size
 An aim to solve new problems or old problems in a better
way
 Big Data generates value from the storage and processing
of very large quantities of digital information that cannot be
analyzed with traditional computing techniques.
5
Volume
· Data
quantity
Velocity
· Data
Speed
Variety
· Data
Types
6
· A typical PC might have had 10 gigabytes of storage in
2000 .
· Today, Face book ingests 500 terabytes of new data every
day.
· Boeing 737 will generate 240 terabytes of flight data during
a single flight across the US.
· T he smart phones, the data they create and consume;
sensors embedded into everyday objects will soon result in
billions of new, constantly- updated data feeds containing
environmental, location, and other information, including
video.
7
 Click streams and ad impressions capture user
behavior at millions of events per second
 high- frequency stock trading algorithms reflect market
changes within microseconds
 machine to machine processes exchange data
between billions of devices
 infrastructure and sensors generate massive log data
in real - time
 on- line gaming systems support millions of concurrent
users, each producing multiple inputs per second.
8
 Big Data isn't just numbers, dates, and strings.
Big Data is also geospatial data, 3D data, audio
and video, and unstructured text, including log
files and social media.
 Traditional database systems were designed to
address smaller volumes of structured data,
fewer updates or a predictable, consistent data
structure.
 Big Data analysis includes different types of
data
9
Handling bigdata-
Parallel computing
· Imagine a 1gb text file, all the status updates on Facebook in a day
· Now suppose that a simple counting of the number of rows takes
10 minutes.
· Select count(*) from fb_status
· What do you do if you have 6 months data, a file of size 200 GB, if
you still want to find the results in 10 minutes?
· Parallel computing?
· Put multiple CPUs in a machine ( 100 ?)
· Write a code that will calculate 200 parallel counts and finally
sums up
· But you need a super computer
10
Handling bigdata - Is there a
better way?
· Till 1985, There is no way to connect multiple computers. All
systems were Centralized Systems.
· So multi -core system or super computers were the only options
for big data problems
· After 1985,We have powerful microprocessors and High Speed
Computer Networks (LANs , WANs), which lead to distributed
systems
· Now that we have a distributed system that ensures a
collection of independent computers appears to its users as a
single coherent system, can we use some cheap computers
and process our bigdata quickly?
11
MapReduce Programming Model
· Processing data using special map() and reduce() functions
· The map() function is called on every item in the input and
emits a series of intermediate key/value pairs(Local
calculation)
· All values associated with a given key are grouped together
· The reduce() function is called on every unique key, and its
value list, and emits a value that is added to the output(final
organization)
12
Not just MapReduce
· Earlier count=count+ 1 was sufficient but now, we need to
1. Setup a cluster of machines, then divide the whole data set into
blocks and store them in local machines
2 . Assign a master node that takes charge of all meta data, work
scheduling and distribution, and job orchestration
3 . Assign worker slots to execute map or reduce functions
4 . Load Balance (What if one machine is very slow in the cluster?)
5. Fault Tolerance (What if the intermediate data is partially read,
but the machine fails before all reduce(collation) operations
can complete?)
6. Finally write the map reduce code that solves our problem
13
 Ok. Analysis on bigdata can give us awesome insights.
 But, datasets are huge, complex and difficult to process.
 I found a solution, distributed computing or MapReduce
 But looks like this data storage & parallel processing
is complicated
 What is the solution?
14
Hadoop
· Hadoop is a bunch of tools, it has many components. HDFS
and MapReduce are two core components of Hadoop
· HDFS: Hadoop Distributed File System
· makes our job easy to store the data on commodity hardware
· Built to expect hardware failures
· Intended for large files & batch inserts
· MapReduce
· For parallel processing
· So Hadoop is a software platform that lets one easily write
and run applications that process bigdata
15
Why Hadoop is useful
· Scalable: It can reliably store and process petabytes.
· Economical: It distributes the data and processing across
clusters of commonly available computers (in thousands).
· Efficient: By distributing the data, it can process it in parallel
on the nodes where the data is located.
· Reliable: It automatically maintains multiple copies of data
and automatically redeploys computing tasks based on
failures.
· And Hadoop is free
16
So what is Hadoop?
· Hadoop is not Bigdata
· Hadoop is not a database
· Hadoop is a platform/framework
· Which allows the user to quickly write and test distributed
systems
· Which is efficient in automatically distributing the data
and work across machines
17
Hadoop ecosystem
18
Big Data ecosystem
28
19
 Examining large amount of data
 Appropriate information
 Identification of hidden patterns, unknown correlations
 Competitive advantage
 Better business decisions: strategic and operational
 Effective marketing, customer satisfaction, increased
revenue
20
 Where processing is hosted?
· Distributed Servers / Cloud (e.g. Amazon EC 2 )
 Where data is stored?
· Distributed Storage (e.g. Amazon S 3 )
 What is the programming model ?
· Distributed Processing (e.g. MapReduce)
 How data is stored & indexed?
· High-performance schema -free databases (e.g. MongoDB)
 What operations are performed on data?
· Analytic / Semantic Processing
21
Application Of Big Data analytics
Smarter
Healthcare
Homeland
Security
Multi-
channel
sales
Telecom
Traffic
Control
Manufacturing
Trading
Analytics
Search
Quality
22
· Will be so overwhelmed
· Need the right people and solve the right problems
· Costs escalate too fast
· Isn’t necessary to capture 100%
· Many sources of big data
is privacy
· self - regulation
· Legal regulation
23
 Our newest research finds that organizations are using big
data to target customer -centric outcomes, tap into internal
data and build a better information ecosystem.
 Big Data is already an important part of the $ 64 billion
database and data analytics market
 It offers commercial opportunities of a comparable
scale to enterprise software in the late 1980s
 And the Internet boom of the 1990s, and the social media
explosion of today.
24
The End
Ajay Agarwal
Roll no : 5201

More Related Content

Similar to bigdata 2.pptx

Similar to bigdata 2.pptx (20)

Big data
Big dataBig data
Big data
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Aginity "Big Data" Research Lab
Aginity "Big Data" Research LabAginity "Big Data" Research Lab
Aginity "Big Data" Research Lab
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
big-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptxbig-data-8722-m8RQ3h1.pptx
big-data-8722-m8RQ3h1.pptx
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Big data Analytics
Big data Analytics Big data Analytics
Big data Analytics
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

bigdata 2.pptx

  • 2. How much time did it take? · Excel : Have you ever tried a pivot table on 500 MB file? · SAS/R : Have you ever tried a frequency table on 2 GB file? · Access: Have you ever tried running a query on 10 GB file · SQL: Have you ever tried running a query on 50 GB file 2
  • 3. Can you think of ? · Can you think of running a query on 20,980,000 GB file. · What if we get a new data set like this, every day? · What if we need to execute complex queries on this data set everyday ? · Does anybody really deal with this type of data set? · Is it possible to store and analyze this data? · Yes Google deals with more than 20 PB data everyday 3
  • 4. In fact, in a minute · Email users send more than 204 million messages; · Mobile Web receives 217 new users; · Google receives over 2 million search queries; · YouTube users upload 48 hours of new video; · Facebook users share 684,000 bits of content; · Twitter users send more than 100 , 000 tweets; · Consumers spend $272,000 on Web shopping; · Apple receives around 47,000 application downloads; · Brands receive more than 34 , 000 Facebook 'likes'; · Tumblr blog owners publish 27,000 new posts; · Instagram users share 3 , 600 new photos; · Flickr users, on the other hand, add 3 , 125 new photos; · Foursquare users perform 2,000 check-ins; · WordPress users publish close to 350 new blog posts. And this is one year back.. Damn!! 4
  • 5.  Collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications  “Big Data” is the data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it  ‘ Big Data’ is similar to ‘small data’, but bigger in size  An aim to solve new problems or old problems in a better way  Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. 5
  • 7. · A typical PC might have had 10 gigabytes of storage in 2000 . · Today, Face book ingests 500 terabytes of new data every day. · Boeing 737 will generate 240 terabytes of flight data during a single flight across the US. · T he smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly- updated data feeds containing environmental, location, and other information, including video. 7
  • 8.  Click streams and ad impressions capture user behavior at millions of events per second  high- frequency stock trading algorithms reflect market changes within microseconds  machine to machine processes exchange data between billions of devices  infrastructure and sensors generate massive log data in real - time  on- line gaming systems support millions of concurrent users, each producing multiple inputs per second. 8
  • 9.  Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.  Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.  Big Data analysis includes different types of data 9
  • 10. Handling bigdata- Parallel computing · Imagine a 1gb text file, all the status updates on Facebook in a day · Now suppose that a simple counting of the number of rows takes 10 minutes. · Select count(*) from fb_status · What do you do if you have 6 months data, a file of size 200 GB, if you still want to find the results in 10 minutes? · Parallel computing? · Put multiple CPUs in a machine ( 100 ?) · Write a code that will calculate 200 parallel counts and finally sums up · But you need a super computer 10
  • 11. Handling bigdata - Is there a better way? · Till 1985, There is no way to connect multiple computers. All systems were Centralized Systems. · So multi -core system or super computers were the only options for big data problems · After 1985,We have powerful microprocessors and High Speed Computer Networks (LANs , WANs), which lead to distributed systems · Now that we have a distributed system that ensures a collection of independent computers appears to its users as a single coherent system, can we use some cheap computers and process our bigdata quickly? 11
  • 12. MapReduce Programming Model · Processing data using special map() and reduce() functions · The map() function is called on every item in the input and emits a series of intermediate key/value pairs(Local calculation) · All values associated with a given key are grouped together · The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output(final organization) 12
  • 13. Not just MapReduce · Earlier count=count+ 1 was sufficient but now, we need to 1. Setup a cluster of machines, then divide the whole data set into blocks and store them in local machines 2 . Assign a master node that takes charge of all meta data, work scheduling and distribution, and job orchestration 3 . Assign worker slots to execute map or reduce functions 4 . Load Balance (What if one machine is very slow in the cluster?) 5. Fault Tolerance (What if the intermediate data is partially read, but the machine fails before all reduce(collation) operations can complete?) 6. Finally write the map reduce code that solves our problem 13
  • 14.  Ok. Analysis on bigdata can give us awesome insights.  But, datasets are huge, complex and difficult to process.  I found a solution, distributed computing or MapReduce  But looks like this data storage & parallel processing is complicated  What is the solution? 14
  • 15. Hadoop · Hadoop is a bunch of tools, it has many components. HDFS and MapReduce are two core components of Hadoop · HDFS: Hadoop Distributed File System · makes our job easy to store the data on commodity hardware · Built to expect hardware failures · Intended for large files & batch inserts · MapReduce · For parallel processing · So Hadoop is a software platform that lets one easily write and run applications that process bigdata 15
  • 16. Why Hadoop is useful · Scalable: It can reliably store and process petabytes. · Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). · Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. · Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. · And Hadoop is free 16
  • 17. So what is Hadoop? · Hadoop is not Bigdata · Hadoop is not a database · Hadoop is a platform/framework · Which allows the user to quickly write and test distributed systems · Which is efficient in automatically distributing the data and work across machines 17
  • 20.  Examining large amount of data  Appropriate information  Identification of hidden patterns, unknown correlations  Competitive advantage  Better business decisions: strategic and operational  Effective marketing, customer satisfaction, increased revenue 20
  • 21.  Where processing is hosted? · Distributed Servers / Cloud (e.g. Amazon EC 2 )  Where data is stored? · Distributed Storage (e.g. Amazon S 3 )  What is the programming model ? · Distributed Processing (e.g. MapReduce)  How data is stored & indexed? · High-performance schema -free databases (e.g. MongoDB)  What operations are performed on data? · Analytic / Semantic Processing 21
  • 22. Application Of Big Data analytics Smarter Healthcare Homeland Security Multi- channel sales Telecom Traffic Control Manufacturing Trading Analytics Search Quality 22
  • 23. · Will be so overwhelmed · Need the right people and solve the right problems · Costs escalate too fast · Isn’t necessary to capture 100% · Many sources of big data is privacy · self - regulation · Legal regulation 23
  • 24.  Our newest research finds that organizations are using big data to target customer -centric outcomes, tap into internal data and build a better information ecosystem.  Big Data is already an important part of the $ 64 billion database and data analytics market  It offers commercial opportunities of a comparable scale to enterprise software in the late 1980s  And the Internet boom of the 1990s, and the social media explosion of today. 24