BIGData, BIGDream
BIN ZHU
Who am I?
• 15years experience
• Telecom
• Enterprise ERP
• National Library
• Internet
• Java
• Database DBA, Designer, developer
• BI, Report, Data Mining
• Architect
• Team leader, Dev Manager, Product Owner
What is Big Data?
大 快 杂 疑
What the Big Data can do for us
• TellYou What Is Likely to Happen
• Find Unexpected Relationships
• Monitor a Situation as it Develops
• Fix a Problem Before it Becomes a Crisis
Real Examples
• Google left China
• Input methods
Real Examples
• Google Ads
• Google Map Service
Real Examples
• Netflix movie recommendation
Real Examples
• Walmart shopping Suggestion and customized flyer
• Credit Card Fraud
Big Data Architecture
Collecting Queue Processing Data Store Visualization
Collecting Queue Processing Data Store Visualization
Big Data Architecture
Collecting Queue Processing Data Store Visualization
Collecting data
Big Data Architecture
Collecting Queue Processing Data Store Visualization
Queue
• Queue is an ordered list of elements of similar data types
• Queue is a FIFO( First in First Out ) structure
Why Queue?
• Decoupling, simplify the system
• Building asynchronous applications
• Make the application much stronger
Popular Queues
Positives and Negatives of each Queue
Kafka architecture
Big Data Architecture
Collecting Queue Processing Data Store Visualization
Processing Engine
Cluster in HA mode (99.99%)
52mins down time in a year
• Availability
• Reliability
• No Single Point of Failure
• Scalable
Hadoop Cluster
Hadoop HDFS (Hadoop Distributed File System)
Hadoop Yarn (Cluster Resource Management)
HowYarn works?
Apache Storm
Apache Storm is a free and open source
distributed real-time computation system.
Use Case
• Real-time analytics
• Online machine learning
• Continuous computation
• Distributed RPC
• ETL
Main feature
• Scalable
• Fault-tolerant
Big Data Architecture
Collecting Queue Processing Data Store Visualization
Data Store
Data Store – RDBMS
Data Store – NoSQL
RDBMS VS NoSQL
RDBMS Cluster
Master – Slave
Master – Master
Sharding
Separate DB
NoSQL Cluster (Cassandra)
Collecting Queue Processing Data Store Visualization
Real Case
ELK
ElasticSearch
Logstash
Kibana
Cache Redis
Cloud
What is behind the Cloud?
Use Cloud Services to build your Project Quickly
Cloud
Public Cloud?Private Cloud?
Real Job Description in Montreal
Company: Hortonworks
Position: SOLUTIONS ENGINEER
SKILLS
• You already know and love core Apache Hadoop (HDFS &YARN) and can talk to the benefits of a centralized
architecture for both data management and data access
• Solution Architecture/Engineering experience as a field of practice (able to listen to
customer requirements, whiteboard and propose solution architecture, and get hands-on with the tech
to design, build, and demonstrate real business value)
• Streaming experience (Nifi, Kafka, Storm, Spark, Solr, etc.)
• NoSQL experience (HBase, Cassandra, MongoDB, etc.) Maybe you can even debate one option over another?
• EDW experience – theTeradatas, Netezzas, GreenPlum/HAWQ, Exadatas of the world
• Integration Products experience (Tibco, MuleSoft, IBM, Oracle, Spring Integration, etc.)
• Start-up company experience
• HortonworksCertified - Admin and/or Developer
http://hortonworks.com/careers/details/102293/
Questions

Big Data, Big Dream