Big Data Analytics Crash Course

Big Data Analytics
A crash course

What is Big Data
• Large and complex datasets
• Structured, semi-structured or unstructured
• Typically does not fit in memory to be
processed
• Distributed storage structure
• 3Vs of Big Data
– Velocity
– Volume
– Variety

Velocity
• Low latency real-time speed
• Examples
– Telephone call records
– Social media
– Retail sales

Volume
• Size of dataset
• KB, MB, GB, TB, PB
• Facebook
– 40 PB of data
– 100 TB/day
• Twitter
– 8 TB/day
• Yahoo
– 60 PB of data
• Big Data size varies from company to company

Variety
• Text
• Audio
• Video
• Photos
• Documents

Physical Infrastructure
• Hardware & Network
– Performance
– Availability
• Resilient & redundant
– Scalability
– Flexibility
– Cost

Security Infrastructure
• Data access
• Application Access
• Data encryption
• Threat detection

Cloud and Big Data
• IaaS – Amazon EC2
• PaaS – Heruku, Pagodabox
• SaaS – GotoMeeting, SalesForce
• DaaS – Amazon
• Major Providers
– Cloudera, Amazon, Azure, Google, Openstack

Organizing Data Services
• Distributed File System
• Serialization & Coordination
• ETL Tools
• Workflow

Big Data Applications
• Log Data Applications
– Splunk, Loggly
• Ad/Media Applications
– Bluefin, DataXu
• Marketing Applications
– Bloomreach, Myrrix

Apache Hadoop
• Open source framework for processing and
querying vast amounts of data on large
clusters of commodity hardware
• Enterprise-ready cloud computing technology
• Industry standard for Big Data
• Jave based – but abstractions available for
various languages
• Concurrency, Scalability, Reliability

HDFS
• Hadoop Distributed File System
• File system to store large datasets
– Blocks of 64 MB instead of 4-32 KB
• Optimized for throughput over latency
• High availability through replication instead of
redundancy
• Optimized for read-many and write-once
• DataNode and NameNode

MapReduce
• Data processing paradigm
• How data will input (Map)
• How data will output (Reduce)
• Works with arbitrarily large datasets
• Integrates tightly with HDFS
• Parallel processing
– Divide and conquer
• Key-value pair instead of RDBMS Schemas
• Job tracker and task tracker

Other components
• Mahout – Machine learning
• Pig – High level language for interacting with
Hadoop
• Hive – Data warehousing
• HBase – Distributed, column-oriented DB
• Sqoop – SQL to Hadoop and vice versa
• Ambari – Web based Hadoop cluster
management

R + Hadoop
• Hadoop for data storage, computation power
• R for advanced analytics, visualization, data
loading
• Cloud based
• RHadoop

Data mining with R
• Regression
– lm
• Classification
– glm, ksvm, svm, randomforest, glmnet
• Clustering
– knn, kmeans, dist, pvclust, Mclust
• Recommendation
– recommenderlab

Hadoop
• Linux based
• Cloudera based
• Java required
• Singlenode or multinode

RHIPE
• R and Hadoop Integrated Programming
Environment
• Divide and Recombine Technique

RHadoop
• Revolution Analytics
• Rhdfs
• Rmr
• Rhbase

Real Time Data Streaming
• IBM Infosphere
• Twitter Storm
• Apache S4 (Simple Scalable Streaming System)

Big Data Analytics Crash Course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Big Data Analytics Crash Course

Similar to Big Data Analytics Crash Course (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics Crash Course