Big data analytics involves collecting and analyzing large, complex datasets. There are three key aspects of big data: volume, referring to the large size of datasets; velocity, meaning the speed of data input and processing; and variety, the different data types including text, audio, video and more. Hadoop is an open-source framework that allows processing and querying vast amounts of data across clusters of computers. It uses HDFS for distributed storage and MapReduce as a processing paradigm to break work into parallelized chunks. R can be used with Hadoop for advanced analytics and visualization of large datasets stored in Hadoop.
2. What is Big Data
• Large and complex datasets
• Structured, semi-structured or unstructured
• Typically does not fit in memory to be
processed
• Distributed storage structure
• 3Vs of Big Data
– Velocity
– Volume
– Variety
3. Velocity
• Low latency real-time speed
• Examples
– Telephone call records
– Social media
– Retail sales
4. Volume
• Size of dataset
• KB, MB, GB, TB, PB
• Facebook
– 40 PB of data
– 100 TB/day
• Twitter
– 8 TB/day
• Yahoo
– 60 PB of data
• Big Data size varies from company to company
11. Organizing Data Services
• Distributed File System
• Serialization & Coordination
• ETL Tools
• Workflow
12. Big Data Applications
• Log Data Applications
– Splunk, Loggly
• Ad/Media Applications
– Bluefin, DataXu
• Marketing Applications
– Bloomreach, Myrrix
13. Apache Hadoop
• Open source framework for processing and
querying vast amounts of data on large
clusters of commodity hardware
• Enterprise-ready cloud computing technology
• Industry standard for Big Data
• Jave based – but abstractions available for
various languages
• Concurrency, Scalability, Reliability
14. HDFS
• Hadoop Distributed File System
• File system to store large datasets
– Blocks of 64 MB instead of 4-32 KB
• Optimized for throughput over latency
• High availability through replication instead of
redundancy
• Optimized for read-many and write-once
• DataNode and NameNode
15. MapReduce
• Data processing paradigm
• How data will input (Map)
• How data will output (Reduce)
• Works with arbitrarily large datasets
• Integrates tightly with HDFS
• Parallel processing
– Divide and conquer
• Key-value pair instead of RDBMS Schemas
• Job tracker and task tracker
16. Other components
• Mahout – Machine learning
• Pig – High level language for interacting with
Hadoop
• Hive – Data warehousing
• HBase – Distributed, column-oriented DB
• Sqoop – SQL to Hadoop and vice versa
• Ambari – Web based Hadoop cluster
management
17. R + Hadoop
• Hadoop for data storage, computation power
• R for advanced analytics, visualization, data
loading
• Cloud based
• RHadoop
18. Data mining with R
• Regression
– lm
• Classification
– glm, ksvm, svm, randomforest, glmnet
• Clustering
– knn, kmeans, dist, pvclust, Mclust
• Recommendation
– recommenderlab