In next 3-5 year hadoop will play major role in many organisation as data is growing in a uncontrolled way because of 3v's characteristics of growing data that is velocity of data,variety of data and volume of data and this lead to big data .
Hadoop solve big data problem using hadoop distributive file system(hdfs) which provide storage required to store both structured and unstructured data and through mapreduce framework it solves the problem of high processing requirement.
2. Content
Introduction of big data .
Data sources .
What is hadoop ??.
Why hadoop ??.
How hadoop works ??.
Mapreduce algorithm .
Problem’s ??.
Conclusion .
3. Introduction to big data
Doug cutting and Mike cafarella involved in a project called
“Nutch” .
Data which is unable to process by traditional systems .
Problems faced by many organisation like google,ibm,facebook
etc.
Explosive growth of data – difficult to make sense.
3 v’s –velocity,variety,volume.
4. Data sources
Facebook generates >25 TB daily.
Airbus >10 TB every 30 min.
Smartphones >5 billion camera phones which are gps
enabled.
Internet users >2 billion people and cisco estimates
internet traffic to be 8 ZB per year.
E-mail sent 300 billion every day .
5. What is Hadoop ????
Open-source software for storing and processing big data .
Distributed .
Framework.
Massive data storage.
Faster processing .
6. Why hadoop ???
Low cost - HDFSs.
Computing power.
Scalability.
Storage flexibility.
Inherent data processing and self healing capabilities.
Large data,calculation,unstructured data..
7. How hadoop works ???
HDFS – java based distributed file system that can store all kind
of data.
MAPREDUCE – a s/w programming model for processing large
sets of data parallel.
YARN – a resource management for scheduling and handling
resource request from distributed applications.
PIG – platform for manipulating data stored in hdfs.
HIVE – a data warehouse.
ZOOKEPER – application that coordinates distributed process.
8. Map reduce algorithm !!!
Large data -> smaller data and mapped to computer -> theme ->
single computer -> o/p.
9. Problem’s ???
Mapreduce –not suitable for iterative and interactive analytic
task.
Mapreduce is file intensive – creates multiple files.
Talent gap.
Fragmented data security issues.
Lacking tools for data quality and standardization.
10. Conclusion
Select the right projects for hadoop implementation.
Rethink and adapt existing architecture to hadoop.
Plan availability of skills and resource before started.
Prepare to deliver trusted data for areas that impacts business
insight and operation .
Adopt lean and agile integration principles.
To have edge in compitition