Big Data Management
on
Apache Hadoop
- Naresh Chintalcheru
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
What is Big Data ?
Big Data is a collection of data sets so large and complex that it
becomes difficult to process using t...
What is Big Data ?
Big Data is a collection of data sets so large and complex that it becomes difficult to
process using t...
What is Big Data ?
Big Data is all about the size ?
Big Data V-V-V-V
Big data is explained using 4 V's
● Volume
● Velocity
● Variety
● Variability
Big Volume
Data usage over the years ....
● 3 1/2 inch Floppy Disk max capacity 1.44MB
● CD max capacity 700MB (Music)
● D...
Big Volume
Imagine your own personal life ...
● Couple decades ago postal mails from friends, household bills and printed
...
Big Velocity
Exponential growth of Corporate & Personal Data
● Personal data
○ More music, more movies and more online tra...
Big Variety
Flavors of data can be just as shocking because combinations of relational data,
unstructured data such as tex...
Big Variability
Data continuously changing ...
● It took years for traditional RDBMS to add an XML column
● Still no JSON ...
Problem with RDBMS
● RDBMS or traditional database deals with Structured Data
● 20% of corporate data is Structured and 80...
Power of Big Data
Big Data
● Deals with unstructured data
● Built on horizontal scaling architecture
Big Data Sources
Data collected from ...
Weblogs, Social Network
Video archives, Photography archives
Mobile Phone data, S...
Big Data Benefits
Create new revenue streams for the companies
The insights that you gain from analyzing your market and i...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
Big Inspiration
Google released series of paper on the technology behind the
Search Product.
● Google released first paper...
Big Inspiration
Inspired by the Google papers ....
Doug Cutting, Yahoo employee at the time saw the opportunity
and led th...
Big Inspiration
Google Products Apache Hadoop Products
GFS: Google File System HDFS: Hadoop Distributed File System
GMR: G...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
Hadoop Architecture
Hadoop Architecture
Unlike traditional databases Hadoop divides Data
Processing and Data Storage into different nodes.
Hadoop Architecture
Hadoop Architecture
What is Hadoop ?
A scalable fault-tolerant grid operating system for
data storage and processing.
-Clo...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
HDFS
HDFS: Hadoop Distributed File System
● Self-healing high-bandwidth clustered storage.
● Streaming very large files on...
HDFS
HDFS
HBase
HBase Database
● Key/Value data store
● Distributed, multi-dimensional sorted map.
● Modeled after Google BigTable
●...
HBase Architecture
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
MapReduce
What is MapReduce ?
● Programming model to process large scale data in parallel
● Automatic parallelization and ...
MapReduce
MapReduce Framework
Map Phase:
Extracts something you care about each record then Shuffle
and Sort the records
R...
MapReduce Architecture
MapReduce Architecture
YARN Framework
What is YARN ?
● Yet Another Resource Negotiator
● Next generation MapReduce framework
● No Job Tracker to ...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
Hive
What is Hive ?
Developed by Facebook engineers and donated to Apache.
Apache Hive is a data warehouse infrastructure ...
Hive
● Query language for HDFS and HBase
● Provides SQL like language called HiveQL
● Automatic conversion of Hive Queries...
Apache Pig
● Developed by Yahoo Pig is a Scripting based query
language for HDFS and HBase
● Language for this platform is...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
Apache Flume
● Hadoop can store and process all the weblogs, network
logs and sensor log data.
● But how the data which is...
Apache Flume
● Flume is the distributed data collection service that gets
flows of data from the source and aggregates the...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
Integration to SAP Business Objects
● Business Objects v4.0 supports Apache Hadoop and Hive
● Business Objects access Hado...
Integration to IBM Cognos
● IBM offers support to Hadoop and named the product IBM
InfoSphere BigInsights
● Added a Web ba...
Agenda
■ What is Big Data
■ Big Inspiration: Google GFS & BigTable
■ Hadoop's Big Architecture
■ Big Files: Apache HDFS an...
Future of Hadoop
● The Big Data is here to stay and companies going to lose in
a big way if they don't utilize the data sc...
Batch to Real-time
● Problem with Hadoop
○ The nature of Hadoop jobs are Batch process and high
latency.
● Google Dremel
○...
References
Yahoo tutorial - http://developer.yahoo.com/hadoop/tutorial/
Apache Hadoop - tutorial
Thank you
Thanks!
Upcoming SlideShare
Loading in...5
×

Apache Hadoop - BigData Management

2,707

Published on

Presented at AITP Twin City Chapter on March 21st, 2013

http://www.aitp.org/members/group_content_view.asp?group=75779&id=184703

http://www.pantagraph.com/calendar/business/aitp-twin-city-chapter-march-meeting-big-data-management-with/event_d149f1d2-8507-11e2-98bb-3cd92bf14f20.html

Published in: Technology, Business
1 Comment
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,707
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide

Apache Hadoop - BigData Management

  1. 1. Big Data Management on Apache Hadoop - Naresh Chintalcheru
  2. 2. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (SAP Business Objects, IBM Cognos) ■ Future of Hadoop: Batch to Real-time
  3. 3. What is Big Data ? Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools. -Wikipedia
  4. 4. What is Big Data ? Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools. ● Large data sets in terms of terabytes and petabytes ● Complex with different data types and formats ● Difficult to process with traditional database tools and involve expensive & proprietary solutions
  5. 5. What is Big Data ? Big Data is all about the size ?
  6. 6. Big Data V-V-V-V Big data is explained using 4 V's ● Volume ● Velocity ● Variety ● Variability
  7. 7. Big Volume Data usage over the years .... ● 3 1/2 inch Floppy Disk max capacity 1.44MB ● CD max capacity 700MB (Music) ● DVD capacity range 10GB (Movies) ● Blu-Ray Disc 25GB (HD, 3D Movies) ● iPod Classic 160GB ● 3TB hard drive for $130 amazon.com
  8. 8. Big Volume Imagine your own personal life ... ● Couple decades ago postal mails from friends, household bills and printed family pictures ● Majority of communications are replaced by Facebook messages, Tweets, SMS Texts and Emails (fading away) ● Upload pictures to Facebook, Flickr or Picasa ● How many bills you pay online ?. You can look up online how much you paid for the same service last year
  9. 9. Big Velocity Exponential growth of Corporate & Personal Data ● Personal data ○ More music, more movies and more online transactions ● Facebook processed (infoq.com) ○ 2 PB of data in 2009 ○ 20PB of data in 2010 ○ 60PB of data in 2011 ○ 100 PB of data in 2012 ● Every Sixty Seconds ... (dzone.com) ○ 694,445 Google Searches ○ 6,600+ pictures uploaded to flickr ○ 98,000 tweets ○ 600 videos uploaded to youtube ○ 13,000 iPhone Apps downloaded
  10. 10. Big Variety Flavors of data can be just as shocking because combinations of relational data, unstructured data such as text, images, video, and every other variation can cause complexity in storing, processing, and querying the data. Traditional Data Big Data Text Data Emails, Documents Pictures, images Stock records Audio, Video Finances 3D Models Personal files Location Sensor data
  11. 11. Big Variability Data continuously changing ... ● It took years for traditional RDBMS to add an XML column ● Still no JSON Column type in RDMS ● Many more new formats to come Dealing with variability in traditional databases is a very very slow process
  12. 12. Problem with RDBMS ● RDBMS or traditional database deals with Structured Data ● 20% of corporate data is Structured and 80% is Unstructured ● Predefined database Schema and Data type makes it harder to adapt to new data formats ● RDBMS horizontal scaling is complex and expensive
  13. 13. Power of Big Data Big Data ● Deals with unstructured data ● Built on horizontal scaling architecture
  14. 14. Big Data Sources Data collected from ... Weblogs, Social Network Video archives, Photography archives Mobile Phone data, Sensors RFID barcodes Medical records Atmospheric Science Personal Finance Camera surveillance e-commerce and m-commerce transactions
  15. 15. Big Data Benefits Create new revenue streams for the companies The insights that you gain from analyzing your market and its consumers with Big Data. Perform effective risk analysis Predictive analytics, fueled by Big Data allows you to scan and analyze newspaper reports or social media feeds so that you permanently keep up to speed on the latest developments in your industry Re-design Products Big Data can also help you understand how others perceive your products so that you can adapt them, or your marketing Social Intelligence Emergence of Social Intelligence similar to Business Intelligence from social network websites Security Benefits Web logs are saved and analysed for unusual access behaviours
  16. 16. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  17. 17. Big Inspiration Google released series of paper on the technology behind the Search Product. ● Google released first paper on Distributed File System GFS in 2003. ● Released second paper about MapReduce framework in 2004. ● Released next paper on BigTable in 2006.
  18. 18. Big Inspiration Inspired by the Google papers .... Doug Cutting, Yahoo employee at the time saw the opportunity and led the charge of developing open source version of GFS & Google MapReduce. Named it after the kids toy Hadoop.
  19. 19. Big Inspiration Google Products Apache Hadoop Products GFS: Google File System HDFS: Hadoop Distributed File System GMR: Google MapReduce MapReduce BigTable HBase Google Dremel Apache Drill
  20. 20. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  21. 21. Hadoop Architecture
  22. 22. Hadoop Architecture Unlike traditional databases Hadoop divides Data Processing and Data Storage into different nodes.
  23. 23. Hadoop Architecture
  24. 24. Hadoop Architecture What is Hadoop ? A scalable fault-tolerant grid operating system for data storage and processing. -Cloudera
  25. 25. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  26. 26. HDFS HDFS: Hadoop Distributed File System ● Self-healing high-bandwidth clustered storage. ● Streaming very large files on the commodity servers. ● Store data in the File format. ● Divides single file into Multiple Blocks ● Fault-tolerant to hardware failures
  27. 27. HDFS
  28. 28. HDFS
  29. 29. HBase HBase Database ● Key/Value data store ● Distributed, multi-dimensional sorted map. ● Modeled after Google BigTable ● Not a RDBMS and light schema ● Random updates to the data possible unlike HDFS.
  30. 30. HBase Architecture
  31. 31. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  32. 32. MapReduce What is MapReduce ? ● Programming model to process large scale data in parallel ● Automatic parallelization and distribution ● Two phase processing Map phase & Reduce phase ● Job Tracker and Task Tracker ● Handle machine failures just like HDFS
  33. 33. MapReduce MapReduce Framework Map Phase: Extracts something you care about each record then Shuffle and Sort the records Reduce Phase: Gets input from the Map Phase then aggregate, filter, transform and summarize the results.
  34. 34. MapReduce Architecture
  35. 35. MapReduce Architecture
  36. 36. YARN Framework What is YARN ? ● Yet Another Resource Negotiator ● Next generation MapReduce framework ● No Job Tracker to control the Task Trackers ● Each job controls its own destiny using Application Master taking care of execution flow such as scheduling tasks, handling speculative execution and failures, etc.
  37. 37. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  38. 38. Hive What is Hive ? Developed by Facebook engineers and donated to Apache. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Operates on compressed data stored into Hadoop ecosystem.
  39. 39. Hive ● Query language for HDFS and HBase ● Provides SQL like language called HiveQL ● Automatic conversion of Hive Queries to MapReduce Jobs ● Accelerate queries by providing Indexes ● Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution ● Facebook has biggest Hive implementation
  40. 40. Apache Pig ● Developed by Yahoo Pig is a Scripting based query language for HDFS and HBase ● Language for this platform is called Pig Latin ● Automatic conversion of Pig Latin Scripts to MapReduce Jobs. Ad-hoc way of creating and executing MapReduce jobs ● Differences between Pig and SQL include Pig's usage of lazy evaluation and ability to store data at any point during a pipeline, explicit declaration of execution plans
  41. 41. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  42. 42. Apache Flume ● Hadoop can store and process all the weblogs, network logs and sensor log data. ● But how the data which is stored on the different servers supplied to the Hadoop Cluster ? Apache Flume comes to rescue
  43. 43. Apache Flume ● Flume is the distributed data collection service that gets flows of data from the source and aggregates them to where they have to be processed. ● Goals include reliability, scalability and extensability.
  44. 44. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  45. 45. Integration to SAP Business Objects ● Business Objects v4.0 supports Apache Hadoop and Hive ● Business Objects access Hadoop using Hive as a Data Source. ● Uses JDBC Driver to connect to the Hadoop Hive. http://events.asug. com/2012BOUC/1210_SAP_BusinessObjects_BI_4_0_FP3_o n_Apache_Hadoop_Hive.pdf
  46. 46. Integration to IBM Cognos ● IBM offers support to Hadoop and named the product IBM InfoSphere BigInsights ● Added a Web based analytical tool called BigSheets ● InfoSphere Biginsights has full integration with Cognos reporting tool http://www-304.ibm.com/easyaccess/fileserve? contentid=217007
  47. 47. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  48. 48. Future of Hadoop ● The Big Data is here to stay and companies going to lose in a big way if they don't utilize the data science opportunity. ● Might see a new enterprise role called Data Scientist ● Apache Hadoop is a cutting data technology and all the current frameworks & tools will change drastically.
  49. 49. Batch to Real-time ● Problem with Hadoop ○ The nature of Hadoop jobs are Batch process and high latency. ● Google Dremel ○ Google released another paper called Dremel project which is the real-time processing of the Big Data. ○ The open source community started Apache Drill which will implement Dremel like real-time processing to Hadoop ecosystem.
  50. 50. References Yahoo tutorial - http://developer.yahoo.com/hadoop/tutorial/ Apache Hadoop - tutorial
  51. 51. Thank you Thanks!

×