SlideShare a Scribd company logo
1 of 36
Download to read offline
Jongwook Woo
HiPIC
CSULA
ENC Lab
Hanyang University
Seoul, Korea
Aug 19th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction To Big Data and
Use Cases using Hadoop
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Emerging Big Data Technology
 Big Data Use Cases
 Hadoop 2.0
 Training in Big Data
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2014)
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certified Cloudera Instructor
 Certified Cloudera Hadoop Developer / Administrator
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Several publications regarding Hadoop and NoSQL
 Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of
MovieLens Data Set using Hive”, in Journal of Science and
Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN
 “Scalable, Incremental Learning with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung,
Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck
Choe. in Proceedings of the International Joint Conference on Neural
Networks, 2013
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las
Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho
Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las
Vegas (July 18-21, 2011)
 Collaboration with Universities and companies
 USC, Texas A&M, Cloudera, Amazon, MicroSoft
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting
– 하둡 창시자
– 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의
창시자
– 아파치 소프트웨어 파운데이션의 보드 멤버
– Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)
Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
–Parallelization
–Fault Tolerance
–Data Distribution
–Load Balancing
High Performance Information Computing Center
Jongwook Woo
CSULA
Map
Convert input data to (key, value) pairs
map() functions run in parallel,
 creating different intermediate (key, value)
values from different input data sets
High Performance Information Computing Center
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can’t start until map phase is
completely finished.
High Performance Information Computing Center
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLs
Stored in log files
Map()
Input <logFilename, file text>
Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Input: <url, list of hit counts> from multiple map
nodes
Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
High Performance Information Computing Center
Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)
(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)
(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)
(http://hello.com, 3)
…
(http://halo.com, 1)
(http://hello.com, 5)
…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)
High Performance Information Computing Center
Jongwook Woo
CSULA© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Architecture [1]
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks
Page 18
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce 1.0 Cons and Future
Bad for
 Fast response time
 Large amount of shared data
 Fine-grained synch needed
 CPU-intensive not data-intensive
 Continuous input stream
Hadoop 2.0: YARN
product
High Performance Information Computing Center
Jongwook Woo
CSULA© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Limitations [1]
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Page 20
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop as Next-Gen Platform [1]
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 21
High Performance Information Computing Center
Jongwook Woo
CSULA© Hortonworks Inc. 2013 - Confidential
Page 22
Hadoop 2 - YARN Architecture
ResourceManager (RM)
NodeManager (NM)Per-Node
ApplicationMaster (AM)
Per-Application –
Manages application
lifecycle and task
scheduling
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN [2]
 Data processing applications and services
 Online Serving – HOYA (HBase on YARN)
 Real-time event processing – Storm, S4, other commercial
platforms
 Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Impala: Interactive SQL
 Enabled by allowing the use of paradigm-specific application
master
High Performance Information Computing Center
Jongwook Woo
CSULA
Josh Wills (Cloudera)
 “I have found that many kinds of
scientists– such as astronomers,
geneticists, and geophysicists– are
working with very large data sets in order
to build models that do not involve
statistics or machine learning, and that
these scientists encounter data
challenges that would be familiar to data
scientists at Facebook, Twitter, and
LinkedIn.”
 “Data science is a set of techniques used
by many scientists to solve problems
across a wide array of scientific fields.”
High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL
Two Machine Learning Use Cases
Comment Moderation
 Evaluate All New HuffPost User Comments
Every Day
 Identify Abusive / Aggressive Comments
 Auto Delete / Publish ~25% Comments Every
Day
Article Classification
 Tag Articles for Advertising
 E.g.: scary, salacious, …
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
 Log files from IPS and IDS
– 1.5GB per day for each systems
 Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
 Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
 Movie Data Analysis
 Hive, Impala
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases: Chip Design and
Seminconductor
Intel
Using Hadoop
– to gather historical information during manufacturing
• and combine new sources of information that had
previously been too unmanageable to use
– a small team of five people was able
• to slash $3 million off the cost of testing just one line
of Intel Core processors in 2012.
Intel IT expects to realize an additional $30 million in
cost avoidance in 2013-2014.
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases: Chip Design and
Seminconductor
AMD
Using a Hadoop implementation,
able to cut the work of employees checking
semiconductor wafer quality data by 90 percent
– by catching faulty product batches earlier in
production.
Samsung
use of a Hadoop file system for handling its data
warehouse
– made analytics processing 10 times faster on 75
percent as much computing power,
• even as data sets grew 10 times larger
High Performance Information Computing Center
Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Need R&D by University
University should own Hadoop Cluster
– It is an inexpensive super computer
Possible to have
– Big Data R&D, training, RFP
We may predict many
– Semiconductor Data Analysis
– System Chip Design Data Analysis
High Performance Information Computing Center
Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Training program
Self-study
– Takes time: more than a year to be an expert
– Don’t know the detail
– Miss many important topics
Cloudera
– $2,000, Hands-on Exercises
– About Hadoop, Hbase, Hive/Pig, Data Analysis,
Spark, Data Mining etc
• 하둡 개발자
• 하둡 시스템관리자
• 하둡 데이터 분석가/과학자
• 하둡 Spark
High Performance Information Computing Center
Jongwook Woo
CSULA
How for Hadoop and Ecosystems
Training program (Cont’d)
Educational Partnership with Cloudera
– One of initiators to launch Cloudera Academic
Programs
• Teach Big Data at CSULA
– http://www.cloudera.com/content/cloudera/en/our-
customers/csula-academic-partnership.html
– Training ppl at Samsung, Other Small companies
in Korea using Cloudera’s material
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Hadoop is supercomputer that you
can own
Hadoop 2.0
Blue Ocean for Publications, RFP
Training is important
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
High Performance Information Computing Center
Jongwook Woo
CSULA
References
1. YARN Apache Hadoop Next Generation
Compute Platform, Bikas Saha
2. Apache Hadoop Yarn Enabling Next
Generation
1. http://www.slideshare.net/hortonworks/apache-
hadoop-yarn-enabling-nex

More Related Content

What's hot

Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in GoryeoJongwook Woo
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 

What's hot (20)

Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 

Viewers also liked

Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public CloudIMC Institute
 
บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 IMC Institute
 
Technology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคารTechnology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคารIMC Institute
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Big data project management
Big data project managementBig data project management
Big data project managementIMC Institute
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveIMC Institute
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษาIMC Institute
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015IMC Institute
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016IMC Institute
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a ServiceIMC Institute
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in ActionIMC Institute
 

Viewers also liked (20)

Big data using Public Cloud
Big data using Public CloudBig data using Public Cloud
Big data using Public Cloud
 
บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558 บทความ การสำรวจตลาด Thai Software & Software Services 2558
บทความ การสำรวจตลาด Thai Software & Software Services 2558
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Technology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคารTechnology Trends ผลกระต่อธุรกิจการธนาคาร
Technology Trends ผลกระต่อธุรกิจการธนาคาร
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Big data project management
Big data project managementBig data project management
Big data project management
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 

Similar to Introduction To Big Data and Use Cases using Hadoop

Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzerpriyal mistry
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLJorge Cardoso
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 

Similar to Introduction To Big Data and Use Cases using Hadoop (20)

Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 

More from Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

More from Jongwook Woo (12)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Recently uploaded

pulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptxpulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptxNishanth Asmi
 
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...marijomiljkovic1
 
electricity generation from food waste - based bioenergy with IOT.pptx
electricity generation from food waste - based bioenergy with IOT.pptxelectricity generation from food waste - based bioenergy with IOT.pptx
electricity generation from food waste - based bioenergy with IOT.pptxAravindhKarthik1
 
The Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesThe Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesJacopo Nardiello
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideMonika860882
 
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdfWave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdfErik Friis-Madsen
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024California Asphalt Pavement Association
 
Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .Ashutosh Satapathy
 
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdftpo482247
 
PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...
PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...
PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...Luuk Brederode
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxSAQIB KHURSHEED WANI
 
عناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringعناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringmennamohamed200y
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structureswendy cai
 
zomato data mining datasets for quality prefernece and conntrol.pptx
zomato data mining  datasets for quality prefernece and conntrol.pptxzomato data mining  datasets for quality prefernece and conntrol.pptx
zomato data mining datasets for quality prefernece and conntrol.pptxPratikMhatre39
 
introduction to python, fundamentals and basics
introduction to python, fundamentals and basicsintroduction to python, fundamentals and basics
introduction to python, fundamentals and basicsKNaveenKumarECE
 
Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)GDSCNiT
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesMark Billinghurst
 
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxChapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxButcher771
 

Recently uploaded (20)

pulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptxpulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptx
 
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
 
electricity generation from food waste - based bioenergy with IOT.pptx
electricity generation from food waste - based bioenergy with IOT.pptxelectricity generation from food waste - based bioenergy with IOT.pptx
electricity generation from food waste - based bioenergy with IOT.pptx
 
The Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesThe Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on Kubernetes
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slide
 
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdfWave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
 
Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .
 
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
 
PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...
PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...
PhD summary of Luuk Brederode, presented at 2023-10-17 to Veitch Lister Consu...
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
 
عناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringعناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineering
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structures
 
zomato data mining datasets for quality prefernece and conntrol.pptx
zomato data mining  datasets for quality prefernece and conntrol.pptxzomato data mining  datasets for quality prefernece and conntrol.pptx
zomato data mining datasets for quality prefernece and conntrol.pptx
 
introduction to python, fundamentals and basics
introduction to python, fundamentals and basicsintroduction to python, fundamentals and basics
introduction to python, fundamentals and basics
 
Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR Experiences
 
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxChapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
 
Update on the latest research with regard to RAP
Update on the latest research with regard to RAPUpdate on the latest research with regard to RAP
Update on the latest research with regard to RAP
 

Introduction To Big Data and Use Cases using Hadoop

  • 1. Jongwook Woo HiPIC CSULA ENC Lab Hanyang University Seoul, Korea Aug 19th 2014 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles Introduction To Big Data and Use Cases using Hadoop
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Emerging Big Data Technology  Big Data Use Cases  Hadoop 2.0  Training in Big Data
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received MicroSoft Windows Azure Educator Grant (Oct 2013 - July 2014)  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certified Cloudera Instructor  Certified Cloudera Hadoop Developer / Administrator  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of MovieLens Data Set using Hive”, in Journal of Science and Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN  “Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Collaboration with Universities and companies  USC, Texas A&M, Cloudera, Amazon, MicroSoft
  • 7. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • 8. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • 9. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On inexpensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 1.0 Hadoop Doug Cutting – 하둡 창시자 – 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의 창시자 – 아파치 소프트웨어 파운데이션의 보드 멤버 – Chief Architect at Cloudera MapReduce HDFS Restricted Parallel Programming – Not for iterative algorithms – Not for graph
  • 13. High Performance Information Computing Center Jongwook Woo CSULA MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else –Parallelization –Fault Tolerance –Data Distribution –Load Balancing
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Example: Sort URLs in the largest hit order Compute the largest hit URLs Stored in log files Map() Input <logFilename, file text> Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce() Input: <url, list of hit counts> from multiple map nodes Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Map/Reduce for URL visits … …Map1() Map2() Mapm() Reduce1 () Reducel() Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://hello.com, <3, 5, 2, 7>) (http://hi.com, 32) (http://hello.com, 17) Input Log Data Reduce2() (http://hi.com, 1) (http://hello.com, 3) … (http://halo.com, 1) (http://hello.com, 5) … (http://halo.com, <1, 5,>) (http://halo.com, 6)
  • 18. High Performance Information Computing Center Jongwook Woo CSULA© Hortonworks Inc. 2013 - Confidential Hadoop 1 Architecture [1] JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks Page 18
  • 19. High Performance Information Computing Center Jongwook Woo CSULA MapReduce 1.0 Cons and Future Bad for  Fast response time  Large amount of shared data  Fine-grained synch needed  CPU-intensive not data-intensive  Continuous input stream Hadoop 2.0: YARN product
  • 20. High Performance Information Computing Center Jongwook Woo CSULA© Hortonworks Inc. 2013 - Confidential Hadoop 1 Limitations [1] Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization Page 20
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Hadoop as Next-Gen Platform [1] HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 21
  • 22. High Performance Information Computing Center Jongwook Woo CSULA© Hortonworks Inc. 2013 - Confidential Page 22 Hadoop 2 - YARN Architecture ResourceManager (RM) NodeManager (NM)Per-Node ApplicationMaster (AM) Per-Application – Manages application lifecycle and task scheduling Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN [2]  Data processing applications and services  Online Serving – HOYA (HBase on YARN)  Real-time event processing – Storm, S4, other commercial platforms  Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Impala: Interactive SQL  Enabled by allowing the use of paradigm-specific application master
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Josh Wills (Cloudera)  “I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.”  “Data science is a set of techniques used by many scientists to solve problems across a wide array of scientific fields.”
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time.
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours
  • 27. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL Two Machine Learning Use Cases Comment Moderation  Evaluate All New HuffPost User Comments Every Day  Identify Abusive / Aggressive Comments  Auto Delete / Publish ~25% Comments Every Day Article Classification  Tag Articles for Advertising  E.g.: scary, salacious, …
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Use Cases experienced Log Analysis  Log files from IPS and IDS – 1.5GB per day for each systems  Extracting unusual cases using Hadoop, Solr, Flume on Cloudera Customer Behavior Analysis Market Basket Analysis Algorithm  Machine Learning for Image Processing with Texas A&M Hadoop Streaming API  Movie Data Analysis  Hive, Impala
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Use Cases: Chip Design and Seminconductor Intel Using Hadoop – to gather historical information during manufacturing • and combine new sources of information that had previously been too unmanageable to use – a small team of five people was able • to slash $3 million off the cost of testing just one line of Intel Core processors in 2012. Intel IT expects to realize an additional $30 million in cost avoidance in 2013-2014.
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Use Cases: Chip Design and Seminconductor AMD Using a Hadoop implementation, able to cut the work of employees checking semiconductor wafer quality data by 90 percent – by catching faulty product batches earlier in production. Samsung use of a Hadoop file system for handling its data warehouse – made analytics processing 10 times faster on 75 percent as much computing power, • even as data sets grew 10 times larger
  • 31. High Performance Information Computing Center Jongwook Woo CSULA How for Hadoop and Ecosystems Need R&D by University University should own Hadoop Cluster – It is an inexpensive super computer Possible to have – Big Data R&D, training, RFP We may predict many – Semiconductor Data Analysis – System Chip Design Data Analysis
  • 32. High Performance Information Computing Center Jongwook Woo CSULA How for Hadoop and Ecosystems Training program Self-study – Takes time: more than a year to be an expert – Don’t know the detail – Miss many important topics Cloudera – $2,000, Hands-on Exercises – About Hadoop, Hbase, Hive/Pig, Data Analysis, Spark, Data Mining etc • 하둡 개발자 • 하둡 시스템관리자 • 하둡 데이터 분석가/과학자 • 하둡 Spark
  • 33. High Performance Information Computing Center Jongwook Woo CSULA How for Hadoop and Ecosystems Training program (Cont’d) Educational Partnership with Cloudera – One of initiators to launch Cloudera Academic Programs • Teach Big Data at CSULA – http://www.cloudera.com/content/cloudera/en/our- customers/csula-academic-partnership.html – Training ppl at Samsung, Other Small companies in Korea using Cloudera’s material
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Conclusion Era of Big Data Need to store and compute Big Data Many solutions but Hadoop Hadoop is supercomputer that you can own Hadoop 2.0 Blue Ocean for Publications, RFP Training is important
  • 35. High Performance Information Computing Center Jongwook Woo CSULA Question?
  • 36. High Performance Information Computing Center Jongwook Woo CSULA References 1. YARN Apache Hadoop Next Generation Compute Platform, Bikas Saha 2. Apache Hadoop Yarn Enabling Next Generation 1. http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex