Introduction To Big Data and Use Cases using Hadoop

Jongwook Woo
HiPIC
CSULA
ENC Lab
Hanyang University
Seoul, Korea
Aug 19th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction To Big Data and
Use Cases using Hadoop

High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Emerging Big Data Technology
 Big Data Use Cases
 Hadoop 2.0
 Training in Big Data

Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심

Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2014)
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership

Jongwook Woo
CSULA
 Certificate
 Certified Cloudera Instructor
 Certified Cloudera Hadoop Developer / Administrator
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual

Jongwook Woo
CSULA
 Several publications regarding Hadoop and NoSQL
 Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of
MovieLens Data Set using Hive”, in Journal of Science and
Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN
 “Scalable, Incremental Learning with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung,
Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck
Choe. in Proceedings of the International Joint Conference on Neural
Networks, 2013
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las
Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho
Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las
Vegas (July 18-21, 2011)
 Collaboration with Universities and companies
 USC, Texas A&M, Cloudera, Amazon, MicroSoft

Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing

Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”

Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data

Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive

Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers

Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting
– 하둡 창시자
– 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의
창시자
– 아파치 소프트웨어 파운데이션의 보드 멤버
– Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph

Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing

Jongwook Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
 creating different intermediate (key, value)
values from different input data sets

Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.

Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>

Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
…
(http://halo.com, 1)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)

Jongwook Woo
CSULA© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Architecture [1]
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks
Page 18

Jongwook Woo
CSULA
MapReduce 1.0 Cons and Future
Bad for
 Fast response time
 Large amount of shared data
 Fine-grained synch needed
 CPU-intensive not data-intensive
 Continuous input stream
Hadoop 2.0: YARN
product

Jongwook Woo
Hadoop 1 Limitations [1]
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Page 20

Jongwook Woo
CSULA
Hadoop as Next-Gen Platform [1]
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 21

Jongwook Woo
Page 22
Hadoop 2 - YARN Architecture
ResourceManager (RM)
NodeManager (NM)Per-Node
ApplicationMaster (AM)
Per-Application –
Manages application
lifecycle and task
scheduling
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request

Jongwook Woo
CSULA
Hadoop 2.0: YARN [2]
 Data processing applications and services
 Online Serving – HOYA (HBase on YARN)
 Real-time event processing – Storm, S4, other commercial
platforms
 Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Impala: Interactive SQL
 Enabled by allowing the use of paradigm-specific application
master

Jongwook Woo
CSULA
Josh Wills (Cloudera)
 “I have found that many kinds of
scientists– such as astronomers,
geneticists, and geophysicists– are
working with very large data sets in order
to build models that do not involve
statistics or machine learning, and that
these scientists encounter data
challenges that would be familiar to data
scientists at Facebook, Twitter, and
LinkedIn.”
 “Data science is a set of techniques used
by many scientists to solve problems
across a wide array of scientific fields.”

Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.

Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours

Jongwook Woo
CSULA
HuffPost | AOL
Two Machine Learning Use Cases
Comment Moderation
 Evaluate All New HuffPost User Comments
Every Day
 Identify Abusive / Aggressive Comments
 Auto Delete / Publish ~25% Comments Every
Day
Article Classification
 Tag Articles for Advertising
 E.g.: scary, salacious, …

Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
 Log files from IPS and IDS
– 1.5GB per day for each systems
 Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
 Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
 Movie Data Analysis
 Hive, Impala

Jongwook Woo
CSULA
Use Cases: Chip Design and
Seminconductor
Intel
Using Hadoop
– to gather historical information during manufacturing
• and combine new sources of information that had
previously been too unmanageable to use
– a small team of five people was able
• to slash $3 million off the cost of testing just one line
of Intel Core processors in 2012.
Intel IT expects to realize an additional $30 million in
cost avoidance in 2013-2014.

Jongwook Woo
CSULA
Use Cases: Chip Design and
Seminconductor
AMD
Using a Hadoop implementation,
able to cut the work of employees checking
semiconductor wafer quality data by 90 percent
– by catching faulty product batches earlier in
production.
Samsung
use of a Hadoop file system for handling its data
warehouse
– made analytics processing 10 times faster on 75
percent as much computing power,
• even as data sets grew 10 times larger

Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Need R&D by University
University should own Hadoop Cluster
– It is an inexpensive super computer
Possible to have
– Big Data R&D, training, RFP
We may predict many
– Semiconductor Data Analysis
– System Chip Design Data Analysis

Jongwook Woo
CSULA
Training program
Self-study
– Takes time: more than a year to be an expert
– Don’t know the detail
– Miss many important topics
Cloudera
– $2,000, Hands-on Exercises
– About Hadoop, Hbase, Hive/Pig, Data Analysis,
Spark, Data Mining etc
• 하둡 개발자
• 하둡 시스템관리자
• 하둡 데이터 분석가/과학자
• 하둡 Spark

Jongwook Woo
CSULA
Training program (Cont’d)
Educational Partnership with Cloudera
– One of initiators to launch Cloudera Academic
Programs
• Teach Big Data at CSULA
– http://www.cloudera.com/content/cloudera/en/our-
customers/csula-academic-partnership.html
– Training ppl at Samsung, Other Small companies
in Korea using Cloudera’s material

Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Hadoop is supercomputer that you
can own
Hadoop 2.0
Blue Ocean for Publications, RFP
Training is important

Jongwook Woo
CSULA
Question?

Jongwook Woo
CSULA
References
1. YARN Apache Hadoop Next Generation
Compute Platform, Bikas Saha
2. Apache Hadoop Yarn Enabling Next
Generation
1. http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex

Introduction To Big Data and Use Cases using Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction To Big Data and Use Cases using Hadoop

Similar to Introduction To Big Data and Use Cases using Hadoop (20)

More from Jongwook Woo

More from Jongwook Woo (12)

Recently uploaded

Recently uploaded (20)

Introduction To Big Data and Use Cases using Hadoop