Apache Hadoop - BigData Management

  • 2,102 views
Uploaded on

Presented at AITP Twin City Chapter on March 21st, …

Presented at AITP Twin City Chapter on March 21st, 2013

http://www.aitp.org/members/group_content_view.asp?group=75779&id=184703

http://www.pantagraph.com/calendar/business/aitp-twin-city-chapter-march-meeting-big-data-management-with/event_d149f1d2-8507-11e2-98bb-3cd92bf14f20.html

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,102
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
1
Likes
11

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data Management on Apache Hadoop - Naresh Chintalcheru
  • 2. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (SAP Business Objects, IBM Cognos) ■ Future of Hadoop: Batch to Real-time
  • 3. What is Big Data ? Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools. -Wikipedia
  • 4. What is Big Data ? Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools. ● Large data sets in terms of terabytes and petabytes ● Complex with different data types and formats ● Difficult to process with traditional database tools and involve expensive & proprietary solutions
  • 5. What is Big Data ? Big Data is all about the size ?
  • 6. Big Data V-V-V-V Big data is explained using 4 V's ● Volume ● Velocity ● Variety ● Variability
  • 7. Big Volume Data usage over the years .... ● 3 1/2 inch Floppy Disk max capacity 1.44MB ● CD max capacity 700MB (Music) ● DVD capacity range 10GB (Movies) ● Blu-Ray Disc 25GB (HD, 3D Movies) ● iPod Classic 160GB ● 3TB hard drive for $130 amazon.com
  • 8. Big Volume Imagine your own personal life ... ● Couple decades ago postal mails from friends, household bills and printed family pictures ● Majority of communications are replaced by Facebook messages, Tweets, SMS Texts and Emails (fading away) ● Upload pictures to Facebook, Flickr or Picasa ● How many bills you pay online ?. You can look up online how much you paid for the same service last year
  • 9. Big Velocity Exponential growth of Corporate & Personal Data ● Personal data ○ More music, more movies and more online transactions ● Facebook processed (infoq.com) ○ 2 PB of data in 2009 ○ 20PB of data in 2010 ○ 60PB of data in 2011 ○ 100 PB of data in 2012 ● Every Sixty Seconds ... (dzone.com) ○ 694,445 Google Searches ○ 6,600+ pictures uploaded to flickr ○ 98,000 tweets ○ 600 videos uploaded to youtube ○ 13,000 iPhone Apps downloaded
  • 10. Big Variety Flavors of data can be just as shocking because combinations of relational data, unstructured data such as text, images, video, and every other variation can cause complexity in storing, processing, and querying the data. Traditional Data Big Data Text Data Emails, Documents Pictures, images Stock records Audio, Video Finances 3D Models Personal files Location Sensor data
  • 11. Big Variability Data continuously changing ... ● It took years for traditional RDBMS to add an XML column ● Still no JSON Column type in RDMS ● Many more new formats to come Dealing with variability in traditional databases is a very very slow process
  • 12. Problem with RDBMS ● RDBMS or traditional database deals with Structured Data ● 20% of corporate data is Structured and 80% is Unstructured ● Predefined database Schema and Data type makes it harder to adapt to new data formats ● RDBMS horizontal scaling is complex and expensive
  • 13. Power of Big Data Big Data ● Deals with unstructured data ● Built on horizontal scaling architecture
  • 14. Big Data Sources Data collected from ... Weblogs, Social Network Video archives, Photography archives Mobile Phone data, Sensors RFID barcodes Medical records Atmospheric Science Personal Finance Camera surveillance e-commerce and m-commerce transactions
  • 15. Big Data Benefits Create new revenue streams for the companies The insights that you gain from analyzing your market and its consumers with Big Data. Perform effective risk analysis Predictive analytics, fueled by Big Data allows you to scan and analyze newspaper reports or social media feeds so that you permanently keep up to speed on the latest developments in your industry Re-design Products Big Data can also help you understand how others perceive your products so that you can adapt them, or your marketing Social Intelligence Emergence of Social Intelligence similar to Business Intelligence from social network websites Security Benefits Web logs are saved and analysed for unusual access behaviours
  • 16. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 17. Big Inspiration Google released series of paper on the technology behind the Search Product. ● Google released first paper on Distributed File System GFS in 2003. ● Released second paper about MapReduce framework in 2004. ● Released next paper on BigTable in 2006.
  • 18. Big Inspiration Inspired by the Google papers .... Doug Cutting, Yahoo employee at the time saw the opportunity and led the charge of developing open source version of GFS & Google MapReduce. Named it after the kids toy Hadoop.
  • 19. Big Inspiration Google Products Apache Hadoop Products GFS: Google File System HDFS: Hadoop Distributed File System GMR: Google MapReduce MapReduce BigTable HBase Google Dremel Apache Drill
  • 20. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 21. Hadoop Architecture
  • 22. Hadoop Architecture Unlike traditional databases Hadoop divides Data Processing and Data Storage into different nodes.
  • 23. Hadoop Architecture
  • 24. Hadoop Architecture What is Hadoop ? A scalable fault-tolerant grid operating system for data storage and processing. -Cloudera
  • 25. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Frameworks: MapReduce and YARN ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 26. HDFS HDFS: Hadoop Distributed File System ● Self-healing high-bandwidth clustered storage. ● Streaming very large files on the commodity servers. ● Store data in the File format. ● Divides single file into Multiple Blocks ● Fault-tolerant to hardware failures
  • 27. HDFS
  • 28. HDFS
  • 29. HBase HBase Database ● Key/Value data store ● Distributed, multi-dimensional sorted map. ● Modeled after Google BigTable ● Not a RDBMS and light schema ● Random updates to the data possible unlike HDFS.
  • 30. HBase Architecture
  • 31. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 32. MapReduce What is MapReduce ? ● Programming model to process large scale data in parallel ● Automatic parallelization and distribution ● Two phase processing Map phase & Reduce phase ● Job Tracker and Task Tracker ● Handle machine failures just like HDFS
  • 33. MapReduce MapReduce Framework Map Phase: Extracts something you care about each record then Shuffle and Sort the records Reduce Phase: Gets input from the Map Phase then aggregate, filter, transform and summarize the results.
  • 34. MapReduce Architecture
  • 35. MapReduce Architecture
  • 36. YARN Framework What is YARN ? ● Yet Another Resource Negotiator ● Next generation MapReduce framework ● No Job Tracker to control the Task Trackers ● Each job controls its own destiny using Application Master taking care of execution flow such as scheduling tasks, handling speculative execution and failures, etc.
  • 37. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 38. Hive What is Hive ? Developed by Facebook engineers and donated to Apache. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Operates on compressed data stored into Hadoop ecosystem.
  • 39. Hive ● Query language for HDFS and HBase ● Provides SQL like language called HiveQL ● Automatic conversion of Hive Queries to MapReduce Jobs ● Accelerate queries by providing Indexes ● Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution ● Facebook has biggest Hive implementation
  • 40. Apache Pig ● Developed by Yahoo Pig is a Scripting based query language for HDFS and HBase ● Language for this platform is called Pig Latin ● Automatic conversion of Pig Latin Scripts to MapReduce Jobs. Ad-hoc way of creating and executing MapReduce jobs ● Differences between Pig and SQL include Pig's usage of lazy evaluation and ability to store data at any point during a pipeline, explicit declaration of execution plans
  • 41. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration: Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 42. Apache Flume ● Hadoop can store and process all the weblogs, network logs and sensor log data. ● But how the data which is stored on the different servers supplied to the Hadoop Cluster ? Apache Flume comes to rescue
  • 43. Apache Flume ● Flume is the distributed data collection service that gets flows of data from the source and aggregates them to where they have to be processed. ● Goals include reliability, scalability and extensability.
  • 44. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 45. Integration to SAP Business Objects ● Business Objects v4.0 supports Apache Hadoop and Hive ● Business Objects access Hadoop using Hive as a Data Source. ● Uses JDBC Driver to connect to the Hadoop Hive. http://events.asug. com/2012BOUC/1210_SAP_BusinessObjects_BI_4_0_FP3_o n_Apache_Hadoop_Hive.pdf
  • 46. Integration to IBM Cognos ● IBM offers support to Hadoop and named the product IBM InfoSphere BigInsights ● Added a Web based analytical tool called BigSheets ● InfoSphere Biginsights has full integration with Cognos reporting tool http://www-304.ibm.com/easyaccess/fileserve? contentid=217007
  • 47. Agenda ■ What is Big Data ■ Big Inspiration: Google GFS & BigTable ■ Hadoop's Big Architecture ■ Big Files: Apache HDFS and HBase ■ Big Frameworks: MapReduce and YARN ■ Big Queries: Hive and Pig Latin ■ Big Pipes: Flume and Scoop ■ Big Integration:Hadoop & BI Tools (Business Objects, Cognos) ■ Future of Hadoop: Batch to Real-time
  • 48. Future of Hadoop ● The Big Data is here to stay and companies going to lose in a big way if they don't utilize the data science opportunity. ● Might see a new enterprise role called Data Scientist ● Apache Hadoop is a cutting data technology and all the current frameworks & tools will change drastically.
  • 49. Batch to Real-time ● Problem with Hadoop ○ The nature of Hadoop jobs are Batch process and high latency. ● Google Dremel ○ Google released another paper called Dremel project which is the real-time processing of the Big Data. ○ The open source community started Apache Drill which will implement Dremel like real-time processing to Hadoop ecosystem.
  • 50. References Yahoo tutorial - http://developer.yahoo.com/hadoop/tutorial/ Apache Hadoop - tutorial
  • 51. Thank you Thanks!