Hadoop for Data Warehousing professionals

  • 1,023 views
Uploaded on

Hadoop simplifies your job as a Data Warehousing professional. With Hadoop, you can manage any volume, variety and velocity of data, flawlessly and comparably in less time. As a Data Warehousing …

Hadoop simplifies your job as a Data Warehousing professional. With Hadoop, you can manage any volume, variety and velocity of data, flawlessly and comparably in less time. As a Data Warehousing professional, you will undoubtedly have troubleshooting and data processing skills. These skills are sufficient for you to be a proficient Hadoop-er.

Key Questions Answered
What is Big Data and Hadoop?
What are the limitations of current Data Warehouse solutions?
How Hadoop solves these problems?
Real World Hadoop Use-Case in Data Warehouse Solutions?

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,023
On Slideshare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Slide 1 Hadoop for Data Warehousing professionals www.edureka.in/hadoop View Complete Course at : www.edureka.in/hadoop * Post your Questions on Twitter on @edurekaIN: #askEdureka
  • 2. Slide 2 Objectives of this Session • Un  What is Big Data  Traditional Data Warehousing Solutions  Problems with Traditional DWH Solutions  Data Warehousing and Hadoop  Hadoop for Big Data  Hadoop and MapReduce  Why Hadoop for Big Data?  Real-world Use Case – Sears Holding Corp.  Where to use what For Queries during the session and class recording: Post on Twitter @edurekaIN: #askEdureka Post on Facebook /edurekaIN www.edureka.in/hadoop
  • 3. Slide 3 Big Data  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. cloud tools statistics No SQL compression storage support database analize information terabytes processing mobile Big Data www.edureka.in/hadoop
  • 4. Slide 4 Unstructured Data is Exploding  2,500 exabytes of new information in 2012 with internet as primary driver  “Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year www.edureka.in/hadoop
  • 5. Slide 5 Big Data Scenarios : Hospital Care Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay. Medical diagnostics company analyzes millions of lines of data to develop first non-intrusive test for predicting coronary artery disease. To do so, researchers at the company analyzed over 100 million gene samples to ultimately identify the 23 primary predictive genes for coronary artery disease www.edureka.in/hadoop
  • 6. Slide 6 http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png Amazon has an unrivalled bank of data on online consumer purchasing behaviour that it can mine from its 152 million customer accounts. Amazon also uses Big Data to monitor, track and secure its 1.5 billion items in its retail store that are laying around it 200 fulfilment centres around the world. Amazon stores the product catalogue data in S3. S3 can write, read and delete objects up to 5 TB of data each. The catalogue stored in S3 receives more than 50 million updates a week and every 30 minutes all data received is crunched and reported back to the different warehouses and the website. Big Data Scenarios : Amazon.com www.edureka.in/hadoop
  • 7. Slide 7 http://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png Netflix uses 1 petabyte to store the videos for streaming. BitTorrent Sync has transferred over 30 petabytes of data since its pre-alpha release in January 2013. The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects. One petabyte of average MP3-encoded songs (for mobile, roughly one megabyte per minute), would require 2000 years to play. Big Data Scenarios: NetFlix www.edureka.in/hadoop
  • 8. Slide 8  IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Web logs Images Videos Audios Sensor Data Volume Velocity Variety IBM’s Definition www.edureka.in/hadoop
  • 9. Slide 9 Traditional Data Warehousing Solutions CRM OLTP SCM LegacyERP Extract, Transform, Load Data Warehouse BI and Analytics Batch Processing www.edureka.in/hadoop
  • 10. Slide 10 Problems with Traditional DWH Solutions Increasing Data Volumes New data sources and types Need for Real-time data processing Email and documents Social Media, Web Logs Machine Device(Scientific) Transactions, OLTP, OLAP www.edureka.in/hadoop
  • 11. Slide 11 Data Warehousing and Hadoop Data Sources Hadoop Extract, Transform File Copy, Streaming DW Query + PresentSensor Data Web Logs www.edureka.in/hadoop
  • 12. Slide 12 Hadoop for Big Data  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. www.edureka.in/hadoop
  • 13. Slide 13 Hadoop and MapReduce Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  highly fault-tolerant  high throughput access to application data  suitable for applications that have large data set  Natively redundant MapReduce (Processing)  software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner  Splits a task across processors Map-Reduce Key Value www.edureka.in/hadoop
  • 14. Slide 14 Hadoop Vs. RDBMS vs.Many enterprises are turning to Hadoop  Especially applications generating big data  Web applications, social networks, scientific applications Hadoop MapReduce Computing Paradigm Traditional database systems RDBMS concepts Database www.edureka.in/hadoop
  • 15. Slide 15 Why Hadoop for Big Data? Database Scalability (petabytes of data, thousands of machines) Flexibility in accepting all data formats (no schema) Efficient and simple fault-tolerant mechanism Commodity inexpensive hardware Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - …. www.edureka.in/hadoop
  • 16. Slide 16 Real-world Use Case – Sears Holding Corp. Case Study: Sears Holding Corporation X  Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data.  Sears incorporated Hadoop in it’s Data Warehousing solution  Insight into data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business.  More Precise Analysis with more data. www.edureka.in/hadoop http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
  • 17. Slide 17 90% of the ~2PB Archived Storage Processing Instrumentation BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid 3. Premature data death 1. Can’t explore original high fidelity raw data 2. Moving data to compute doesn’t scale Mostly Append A meagre 10% of the ~2PB Data is available for BI Storage only Grid (original Raw Data) Collection Limitations of Existing Data Analytics Architecture www.edureka.in/hadoop
  • 18. Slide 18 *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. No Data Archiving 1. Data Exploration & Advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever Mostly Append Instrumentation BI Reports + Interactive Apps RDBMS (Aggregated Data) Collection Hadoop : Storage + Compute Grid Entire ~2PB Data is available for processing Both Storage And Processing Solution: Data Warehousing with Hadoop www.edureka.in/hadoop
  • 19. Slide 19 Scalability in Hadoop Slaves Master www.edureka.in/hadoop
  • 20. Slide 20 Where to use what? Requirement Data Warehouse Hadoop Low latency, interactive reports, and OLAP ANSI 2003 SQL compliance is required Preprocessing or exploration of raw unstructured data Online archives alternative to tape High-quality cleansed and consistent data 100s to 1000s of concurrent users Discover unknown relationship in the data Parallel complex process logic CPU intense analysis Systems, users and data governance Many flexible programming languages running in parallel Unrestricted, ungoverned sand box explorations Analysis of provisional data Extensive security and regulatory compliance Real time data loading and 1 second tactical queries www.edureka.in/hadoop http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which
  • 21. Slide 21 Questions? Buy Complete Course at : www.edureka.in/hadoop Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in/hadoop