College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
Introduction To Big Data and Use Cases using Hadoop
1. Jongwook Woo
HiPIC
CSULA
ENC Lab
Hanyang University
Seoul, Korea
Aug 19th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction To Big Data and
Use Cases using Hadoop
2. High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
Emerging Big Data Technology
Big Data Use Cases
Hadoop 2.0
Training in Big Data
3. High Performance Information Computing Center
Jongwook Woo
CSULA
Me
이름: 우종욱
직업:
교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
경력:
2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
2009여년 부터 하둡 빅데이타에 관심
4. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Grants
Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2014)
Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
Partnership
Received Academic Education Partnership with Cloudera since
June 2012
Linked with Hortonworks since May 2013
– Positive to provide partnership
5. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Certificate
Certified Cloudera Instructor
Certified Cloudera Hadoop Developer / Administrator
Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
Blog and Github for Hadoop and its ecosystems
http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
https://github.com/dalgual
6. High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Several publications regarding Hadoop and NoSQL
Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of
MovieLens Data Set using Hive”, in Journal of Science and
Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN
“Scalable, Incremental Learning with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung,
Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck
Choe. in Proceedings of the International Joint Conference on Neural
Networks, 2013
“Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las
Vegas (July 16-19, 2012)
“Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho
Kim, EDB 2012, Incheon, Aug. 25-27, 2011
“Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las
Vegas (July 18-21, 2011)
Collaboration with Universities and companies
USC, Texas A&M, Cloudera, Amazon, MicroSoft
7. High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
8. High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
9. High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
10. High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
11. High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
12. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting
– 하둡 창시자
– 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의
창시자
– 아파치 소프트웨어 파운데이션의 보드 멤버
– Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph
13. High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
14. High Performance Information Computing Center
Jongwook Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
creating different intermediate (key, value)
values from different input data sets
15. High Performance Information Computing Center
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.
16. High Performance Information Computing Center
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
17. High Performance Information Computing Center
Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
(http://hello.com, 3)
…
(http://halo.com, 1)
(http://hello.com, 5)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)
19. High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce 1.0 Cons and Future
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
Hadoop 2.0: YARN
product
23. High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN [2]
Data processing applications and services
Online Serving – HOYA (HBase on YARN)
Real-time event processing – Storm, S4, other commercial
platforms
Tez – Generic framework to run a complex DAG
MPI: OpenMPI, MPICH2
Master-Worker
Machine Learning: Spark
Graph processing: Giraph
Impala: Interactive SQL
Enabled by allowing the use of paradigm-specific application
master
24. High Performance Information Computing Center
Jongwook Woo
CSULA
Josh Wills (Cloudera)
“I have found that many kinds of
scientists– such as astronomers,
geneticists, and geophysicists– are
working with very large data sets in order
to build models that do not involve
statistics or machine learning, and that
these scientists encounter data
challenges that would be familiar to data
scientists at Facebook, Twitter, and
LinkedIn.”
“Data science is a set of techniques used
by many scientists to solve problems
across a wide array of scientific fields.”
25. High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
26. High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
27. High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL
Two Machine Learning Use Cases
Comment Moderation
Evaluate All New HuffPost User Comments
Every Day
Identify Abusive / Aggressive Comments
Auto Delete / Publish ~25% Comments Every
Day
Article Classification
Tag Articles for Advertising
E.g.: scary, salacious, …
28. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
Log files from IPS and IDS
– 1.5GB per day for each systems
Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
Movie Data Analysis
Hive, Impala
29. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases: Chip Design and
Seminconductor
Intel
Using Hadoop
– to gather historical information during manufacturing
• and combine new sources of information that had
previously been too unmanageable to use
– a small team of five people was able
• to slash $3 million off the cost of testing just one line
of Intel Core processors in 2012.
Intel IT expects to realize an additional $30 million in
cost avoidance in 2013-2014.
30. High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases: Chip Design and
Seminconductor
AMD
Using a Hadoop implementation,
able to cut the work of employees checking
semiconductor wafer quality data by 90 percent
– by catching faulty product batches earlier in
production.
Samsung
use of a Hadoop file system for handling its data
warehouse
– made analytics processing 10 times faster on 75
percent as much computing power,
• even as data sets grew 10 times larger
31. High Performance Information Computing Center
Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Need R&D by University
University should own Hadoop Cluster
– It is an inexpensive super computer
Possible to have
– Big Data R&D, training, RFP
We may predict many
– Semiconductor Data Analysis
– System Chip Design Data Analysis
32. High Performance Information Computing Center
Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Training program
Self-study
– Takes time: more than a year to be an expert
– Don’t know the detail
– Miss many important topics
Cloudera
– $2,000, Hands-on Exercises
– About Hadoop, Hbase, Hive/Pig, Data Analysis,
Spark, Data Mining etc
• 하둡 개발자
• 하둡 시스템관리자
• 하둡 데이터 분석가/과학자
• 하둡 Spark
33. High Performance Information Computing Center
Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Training program (Cont’d)
Educational Partnership with Cloudera
– One of initiators to launch Cloudera Academic
Programs
• Teach Big Data at CSULA
– http://www.cloudera.com/content/cloudera/en/our-
customers/csula-academic-partnership.html
– Training ppl at Samsung, Other Small companies
in Korea using Cloudera’s material
34. High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Hadoop is supercomputer that you
can own
Hadoop 2.0
Blue Ocean for Publications, RFP
Training is important
36. High Performance Information Computing Center
Jongwook Woo
CSULA
References
1. YARN Apache Hadoop Next Generation
Compute Platform, Bikas Saha
2. Apache Hadoop Yarn Enabling Next
Generation
1. http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex