Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
2. Big Data: An Overview
Big Data
- High volume
- High velocity
- High variety information assets
- High Veracity
- Require new forms of processing
- Like NoSQL, MapReduce, Machine Learning
Examples
Large Hadron Collider
150 million sensors -> data 40 million times/sec
data flow > 150 million petabytes (annual ), or ~ 500 exabytes per day
Tipp24 (European lotteries)
Analyze billions of transactions and hundreds of customer attributes
Leads to a 90% decrease in the time it took to build predictive models
4. Hadoop: Elephant in the Room
Apache Hadoop
- open-source Java-based software framework
- distributed processing of large data sets
- On clusters of computers based on commodity hardware.
Hadoop’s Benefits (Historical context)
- Don’t rely on Hardware to provide HA (“Big Iron”)
- Failures are expected and assumed
- Framework handles failures to provide a HA computing service
- “Scale Up v/s Scale Out”
Key Components
- Hadoop Distributed File System (HDFS™) – the file system
- Hadoop MapReduce – the programming model
- Hadoop (v2) YARN: the resource manager
Year Activity
2002Nutch Started
2003 GFS White Paper published
2004
Google MapReduce White
Paper
2005 First MR Implementation
2006 Hadoop project in Apache
2008 Hadoop in Y! Production
2009 Wins 500GB sort contest
7. Hadoop: FAQs
What is a Map-Reduce job and why do I care ?
Processing data paradigm in hadoop
Batch-mode or in real-time
In Java or in a variety of other langs (see below).
There are higher-level frameworks that help too like Pig , Hive, etc..
I don’t drink java anymore – what do I do ?
Hadoop is Java-based but …
Hadoop Streaming supports python, Ruby, R, etc.
I/O bound – no difference. CPU-bound – Java better
What is Hadoop2 and how will it affect my big data needs (See slide#14)
Much more scalable
Programming models v/s Cluster & Resource Management
Under what scenarios should I not use Hadoop ?
Need Answers in a Hurry
Queries Are Complex Needing Optimization
Require Random, Interactive Access to Data
Store Sensitive Data
Replacing Data Warehouse
What are differences between Hadoop & traditional database ?
Hadoop is not a DB
ACID properties
Unstructured / mixture of data sources
SQL Access
8. Hadoop Stack: Snapshot
Technology Domain Description
HDFS File Storage Java-based file storage - reliable and scalable access
MapReduce Programming Framework Original framework for distributed processing of data
Hadoop YARN Resource Mgmt Next generation framework – MR and non-MR
models
Pig ETL / Data Flow Allows High level analysis of large data. Generates MR
Hive SQL Interface DW - allows data summarization and ad-hoc queries
Hbase Columnar NoSQL storage Column-oriented NoSQL data storage system
Sqoop Data Exchange Easy data import/export from Hadoop clusters
Zookeeper Process Coordination Highly available system for process coordination
Oozie Workflow Scheduler Helps manage complex DAG job workflows
Ambari Cluster Monitoring Installation, Admin & Monitoring for Hadoop clusters
Avro Serializer Serializes data in efficient binary format. Uses JSON.
Spark Real-time data
processing
Powerful processing engine - speed, ease of use, and
sophisticated analytics (using ML).
9.
10. Data Science: The Scoop
What is Data Science or a Data Scientist ?
To understand data, to process it, to extract value from it, to visualize it, to communicate it
Single source v/s disparate sources
Mine data for insight to extract business/competitive value
What is Machine Learning then ?
The science of getting computers to act without being explicitly programmed.
Machine learning and statistics may be the stars, but DS orchestrates the whole show.
Practical Uses
Product Recommendation
Medical Diagnosis
Stock Trading
Face Detection
11. Demo: Lets get dirty !
Hadoop running on Single-Node Pseudo Cluster (Linux VM)
Start Hadoop
HelloWorld Hadoop style
Run a MapReduce job (wordcount)
No Java here
Use python scripts to run a MapReduce job
Lipstick on a Pig
Perform ETL on some stocks/dividend data
Give me Hive
Calculate Top Batter Scores
Can you feel the Hbase
Dump Sales Data into Hbase and then access via Hive
Use AWS to show a ‘real’ cluster
Connect to AWS and startup the cluster
Demo performance using wordcount example
* All Demos, installation guide and references available @ GitHub
Introduce Hadoop, Map-Reduce and HDFS concepts.
Hadoop
Apache Hadoop is an open-source software framework allowing for distributed processing of large data sets across clusters of computers on commodity hardware.
USP
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
- Mike Cafarella and Doug Cutting estimated a system supporting a one-billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000.
- Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web.
- Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes.
- In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS).
In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch
- in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see the sidebar Hadoop at Yahoo!).
- This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster.
- (May 2009), it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.
What is a Map-Reduce job and why do I care ?
Processing data paradigm in hadoop
Batch-mode or in real-time
In Java or in a variety of other langs (see below).
There are higher-level frameworks that help too like Pig , Hive, etc..
I don’t drink java anymore – what do I do ?
Hadoop is Java-based but …
Hadoop Streaming supports python, Ruby, R, etc.
I/O bound – no difference. CPU-bound – Java better
what is Hadoop2 and how will it affect my big data needs (See slide#14)
Muchmore scalable (3,500 -> ~10000 nodes)
Abstraction between the programming models (MapReduce, Impala, etc.) and cluster & resource management
Under what scenarios should I not use Hadoop ?
Need Answers in a Hurry – MR crunching can take hours or days sometimes
Queries Are Complex and Require Extensive Optimization – need serious tech skills for optimizing queries
Require Random, Interactive Access to Data – SQL on Hadoop is getting better but not yet comparable
Store Sensitive Data – Hadoop has less than stellar security capabilities
Replacing Data Warehouse – Hadoop can pre-process raw data and hand over to DW to run analytic workloads
What are differences between Hadoop & traditional database ?
Hadoop is not a DB, more like a file system (HDFS)
Traditional DBs have ACID properties and Hadoop doesn’t support this OOTB
Traditional DBs can support Unstructured but less efficiently. Hadoop shines with a mixture of data sources
Hadoop SQL access is an order of magnitude(s) slower than traditional SQL
Hortonworks and Cloudera-
They both offer the same basic service to their customers- enterprise-ready Hadoop with greater security and stability as well as training for companies unfamiliar with the technology. Many have drawn the dividing line down how Hortonworks and Cloudera approach data warehouses, suggesting Hortonworks want to complement existing data warehouse storage and Cloudera want to do away with it altogether. Yet if you look at how Cloudera’s suggested deployment for its Enterprise Data Hub, it does incorporate legacy warehouse storage. A greater distinction can be found in what technologies the companies offer. Hortonworks are open-source purists, using only technology that’s open-sourced through the Apache Foundation; when you pay for Cloudera, you pay for a whole stack of proprietary and open source components, including online NoSQL (HBase), analytic SQL (Impala), in-memory processing and machine learning (Apache Spark) and data management (Cloudera Manager).
Hortonworks Cloudera
Money raised $225 million $900 million ($740 million from a recent partnership with Intel)
Customers Added 250 customers in the past five quarters; big names include Spotify, ebay, Bloomberg and Samsung. Estimated around the 350-mark. Big names include Nokia, Mastercard, BT and ebay (curiously appearing on both Hortonworks’ and Cloudera’s customer lists)
Partners Around 300 listed on their website, including SAP, HP and Dell- a full list can be found here. Over 1,000, including HP, IBM, Intel… a full list can be found here.
MapR is founded on the idea that the Apache Hadoop core is a beautiful thing that needs to grow up fast to have the most impact on the enterprise. What MapR has done is add some proprietary software for helping manage the installation, configuration, and operation of its distribution. But MapR rejects open source purity. Srivas has taken significant parts of Hadoop and re-implemented them in an API compatible manner.
Hortonworks and Cloudera argue using the API-compatible approach means that MapR isn’t open source. MapR argues back: Do you want to have read write access to your files system? Do you want to be able to handle lots of small files? Do you want to support NFS in a production quality matter so other software you have can use the data in HDFS? Do you want to have better security that doesn’t require Kerberos? Do you want to be able to run other software like Vertica on the machines in the Hadoop cluster?