SlideShare a Scribd company logo
1 of 51
NYC Data Science Academy
Hadoop Application Development with Real Cases
Hadoop Application Development with Real Cases
NYC Data Science Academy
Hadoop Application Development with Real Cases
Multi-layer Model
2
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Pyramid and Character
 Business personnel
 ETL Engineer
 Data Warehouse Engineer
 Analyzer
 Data Visualization Engineer
 IT supporter: Operation-
Maintanence, Programmer
3
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis
 Analyze collected data with statistical methods on purpose, then understand and
implement the result
4
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Mining
 Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply
knowledge-discovery algorithms to large database and show the associations to the users.
 Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning
 Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis
 Case: Beer and Diaper
 Science: Detecting Novel Associations in Large Data Sets
5
NYC Data Science Academy
Hadoop Application Development with Real Cases
Business Intelligence
 BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) +
Report (Demonstration)
 Our course
6
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis Algorithms
 Popular Algorithms
7
NYC Data Science Academy
Hadoop Application Development with Real Cases
Regression
8
NYC Data Science Academy
Hadoop Application Development with Real Cases
Time Series Analysis
NYC Data Science Academy
Hadoop Application Development with Real Cases
Classifier
10
NYC Data Science Academy
Hadoop Application Development with Real Cases
Clustering
11
NYC Data Science Academy
Hadoop Application Development with Real Cases
Association Rules
12
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis
 Data Analysis Tools
13
NYC Data Science Academy
Hadoop Application Development with Real Cases
Popular Data Analysis Tools Ranking
14
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis stages
 stage 1: Dominate by Business personnel
 stage 2: Dominate by both Business personnel and Analyzer
 stage 3: Dominate by Analyzer
15
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 1
 Business staff set all the requirements and most analysis plans
 According to experiences, Business staff select features, set threshold, and
IT staff search, integrate data, analyzer make report
 Feature selection and choice of threshold is based on experience and
personal knowledge
 Suitable for simple cases, analysis technique is equivalent to the simplest
decision tree
 Business staffs has valuable experiences and hard to be replaced,
analyzers are just for graphing and is easily replaced
 This is common in the traditional industry
16
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 2
 More complex. Business staffs could analyze a small number of
data records while cannot figure out all the features and the
relationship among them. They have no experience with large
number of samples.
 Analyzer come to clean data and select features, and finally build
suitable model to solve problem.
 Business staffs and analyzer could evaluate the result together,
very likely to success. Analyzer prefer this step because their ability
and value is confirmed.
17
NYC Data Science Academy
Hadoop Application Development with Real Cases
Spammer in Wordpress
NYC Data Science Academy
Hadoop Application Development with Real Cases
Data Analysis in stage 3
 Business staffs have no experience for the
case, and cannot offer any useful prior
knowledge
 Data analyzers use various tools and models to
mine the data and trying to have interesting
discovery
 It is analyzer’s ideal world, while it is likely to
fail
 Business staffs cannot get involved, and they
dislike this stage
19
NYC Data Science Academy
Hadoop Application Development with Real Cases
Step Forward
 The first stage(Gold on the ground) -> The second
stage(Gold beneath the ground) -> The third stage (Gold
deeply buried)
 If analyzers are reckless, business staffs will resist to
help
 Data analysis is rooted in the business background. The
goal of analysis is increasing profit. Successful analysis
could not be apart from business
 Interesting topic is more important than the model
20
NYC Data Science Academy
Hadoop Application Development with Real Cases
What is Big Data
NYC Data Science Academy
Hadoop Application Development with Real Cases
Features of Big Data
NYC Data Science Academy
Hadoop Application Development with Real Cases
Challenges for Analyzers
 Bottleneck for both insertion and query due to the increasing amount of data
 The trend of integrating users’ application and analysis result is asking for faster
real-time computation and response time
 More complex models require more expensive computation
23
NYC Data Science Academy
Hadoop Application Development with Real Cases
Dilemma of Traditional Data Analysis
Tools
 R, SAS, SPSS are experimental tools
 Capable data size is restricted by the memory size
 Use Oracle database for large volume of data, but lack of professional and fast
analyzing ability
 Sampling is a limited solution, it is not useful for clustering and recommendation
system
 Solution: Hadoop cluster and Map-Reduce parallel computing
24
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 1: analysis and monitor for a
telecommunication company
25
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 1: analysis and monitor for a
telecommunication company
 Configuration of the original database server: HP minicomputer, 128G memory, 48-
core CPU, RAC with two nodes, one node for insertion and the other for query
 Storage: HP virtual storage, over 1000 disks
 Architecture: Oracle RAC with two nodes
 Bottleneck: 1. Insertion 2. Query
26
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 2: DNA database
27
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 3: Social analysis, activity
fingerprint detection

28|
Public Voice
mail intersect IMSI 1 IMSI 2 …… IMSI n
total call
duration
User A IMSI 20% 12% …… 5% 365
User B IMSI 15% 13% …… 2% 310
Public SMS
intersect IMSI 1 IMSI 2 …… IMSI n
Monthly
SMS count
User A IMSI 50% 10% …… 5% 200
User B IMSI 20% 13% …… 2% 260
Public base
station CGI 1 CGI 2 …… CGI n Shutdown
User A IMSI 20% 12% …… 5% 20%
User B IMSI 15% 13% …… 2% 5%
Public Fingerprint
(0.2, 0.12, …, 0.05)
(0.15, 0.13, …, 0.02)
(0.5, 0.1, …, 0.05)
(0.2, 0.13, …, 0.02)
(0.2, 0.12, …, 0.05, 0.2)
(0.15, 0.13, …, 0.02, 0.05
eigenvector
NYC Data Science Academy
Hadoop Application Development with Real Cases

When equals to , these two vectors are independent
When equals to 0 , these two vectors are perfectly dependent
The closer is from 0, the more dependent these vectors are
90
Case 3: Social analysis, activity
fingerprint detection
29
NYC Data Science Academy
Hadoop Application Development with Real Cases
Case 3: Social analysis, VIP detection
30
NYC Data Science Academy
Hadoop Application Development with Real Cases
Solution that analyzers look forward to
 Perfectly eliminate the bottleneck in the foreseeable future
 Smoothly transplant available techniques, for example SQL and R.
 The cost of new platform: hardware and software, re-development, skill training,
maintenance
31
NYC Data Science Academy
Hadoop Application Development with Real Cases
Path to Big Data
NYC Data Science Academy
Hadoop Application Development with Real Cases
Idea of Hadoop
33
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce Programming
34
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce program for meteorological
data analysis
35
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce implementation for popular
algorithms
36
NYC Data Science Academy
Hadoop Application Development with Real Cases
Map-Reduce implementation for popular
algorithms
37
NYC Data Science Academy
Hadoop Application Development with Real Cases
Why not Hadoop?
 Java?
 Hard to control?
 Hard to integrate data?
 Hadoop vs Oracle
38
NYC Data Science Academy
Hadoop Application Development with Real Cases
Analysis under Hadoop system
 Mainstream: Java program
 Light-weighted script language: Pig
 Smooth transplant from SQL: Hive
 NoSQL: HBase
39
NYC Data Science Academy
Hadoop Application Development with Real Cases
Family of Hadoop
40
NYC Data Science Academy
Hadoop Application Development with Real Cases
pig
 Pig could be treated as a client software
to the hadoop, could connect to hadoop
and analyze
 Pig is convenient for users unfamiliar
with java, using a SQL-like language,
pig latin, dealing with data flow
 Pig latin could perform sorting, filtering,
sum, grouping, association, and define
custom functions. It is a light-weighted
script language for data operation and
analysis
 Pig could be treated as the mapping
from pig latin to map-reduce
41
NYC Data Science Academy
Hadoop Application Development with Real Cases
Hive
 Data warehouse tool, could turn
primary data structure in Hadoop into
tables in Hive
 Support HiveQL, a language almost
the same as SQL, its function is the
same as SQL except updating,
indexing and
 could be treated as the mapping from
SQL to map-reduce
 Offering interfaces for shell、
JDBC/ODBC、Thrift、Web
42
NYC Data Science Academy
Hadoop Application Development with Real Cases
Features of Mahout
 Mahout is for scalable machine learning
algorithms (M-R implementation), and
Hadoop platform is not necessary. The
core library also have efficient algorithms
on single machine
 Mature and popular algorithms are
1. Frequent Itemset Mining
2. Clustering
3. Classifier
4. Recommendation System
5. Frequent Subgraph Mining
43
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science Academy
Hadoop Application Development with Real Cases
Reference Textbooks
47
NYC Data Science Academy
Hadoop Application Development with Real Cases
Typical Experiment Environtment(with
server)
 Server: ESXi, capable of deploying multiple virtual machines and could run 3
machines at the same time
 PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine
 SSH: Use command ssh under linux, and SecureCRT or putty under Windows to
connect with remote linux server
 Vmware client: Management of ESXi
 Hadoop: Use version 1.x or 2.x
48
NYC Data Science Academy
Hadoop Application Development with Real Cases
Typical Experiment Environtment(with
only PC or laptop running Windows)
 At Least 4G memory, 64bit windows is preferred, because 32bit machine can use
only more than 3G memory.
 Install vmware workstation or virtual box
 Deploy 3 virtual machines and running at the same time. If can only run two VMs,
treat host as a node (by cygwin), and use bridged networking for virtual network
 Install Linux and Java
 Old computers could consider pseudo-distributed environment
49
NYC Data Science Academy
Hadoop Application Development with Real Cases
Experiment Environment
 Deploy Pig
 Deploy Hive
 Deploy Mahout
NYC Data Science Academy
Hadoop Application Development with Real Cases
List of Cases of the Course
 Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)
 LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map-
Reduce)
 User analysis for telecommunication company; Labeling duplicated users by the fingerprint of
calls(Map-Reduce)
 Recommendation system for E-commerce company(Map-Reduce)
 Complicated recommendation system application(mahout)
 Social network; Distance between users; Community detection(Pig)
 Importance of nodes in a social network(Map-Reduce)
 Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)
 Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)
 Set stock strategies with data analysis(Map-Reduce, Hive)
 GPS application; Sign-in data analysis(Pig)
 Implementation and optimization of sorting on Map-Reduce
 Middleware development; Cooperation of multiple Hadoop clusters

More Related Content

What's hot

MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
 
Data Science at Speed. At Scale.
Data Science at Speed. At Scale.Data Science at Speed. At Scale.
Data Science at Speed. At Scale.DataWorks Summit
 
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address RequirementsGov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address RequirementsDataWorks Summit
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
 
Ibm machine learning for z os
Ibm machine learning for z osIbm machine learning for z os
Ibm machine learning for z osCuneyt Goksu
 
Privacy-Preserving AI Network - PlatON 2.0
Privacy-Preserving AI Network - PlatON 2.0 Privacy-Preserving AI Network - PlatON 2.0
Privacy-Preserving AI Network - PlatON 2.0 ShiHeng1
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformArne Roßmann
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
 
IBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesIBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesTony Pearson
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes John Archer
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture Mark Hewitt
 
Machine Learning Everywhere
Machine Learning EverywhereMachine Learning Everywhere
Machine Learning EverywhereDataWorks Summit
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldSrivatsan Srinivasan
 

What's hot (20)

MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Data Science at Speed. At Scale.
Data Science at Speed. At Scale.Data Science at Speed. At Scale.
Data Science at Speed. At Scale.
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address RequirementsGov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address Requirements
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
Ibm machine learning for z os
Ibm machine learning for z osIbm machine learning for z os
Ibm machine learning for z os
 
Privacy-Preserving AI Network - PlatON 2.0
Privacy-Preserving AI Network - PlatON 2.0 Privacy-Preserving AI Network - PlatON 2.0
Privacy-Preserving AI Network - PlatON 2.0
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service Providers
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
OpenPOWER Update
OpenPOWER UpdateOpenPOWER Update
OpenPOWER Update
 
IBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesIBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use Cases
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
Machine Learning Everywhere
Machine Learning EverywhereMachine Learning Everywhere
Machine Learning Everywhere
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
 

Viewers also liked

Learn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best PracticesLearn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best PracticesDriven Inc.
 
Ebook 5 steps to speak a new language
Ebook 5 steps to speak a new languageEbook 5 steps to speak a new language
Ebook 5 steps to speak a new languageHien Vu
 
Présentation Igloo - Forum Mind&Market 2016
Présentation Igloo - Forum Mind&Market 2016Présentation Igloo - Forum Mind&Market 2016
Présentation Igloo - Forum Mind&Market 2016Vincent Huwer
 
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015MCP Deutschland GmbH
 
Informativa bluemoz italiano web
Informativa bluemoz italiano webInformativa bluemoz italiano web
Informativa bluemoz italiano webLeonardo Zalateu
 
PPT Del Paseo Museo Interactivo Mirador
PPT Del Paseo Museo Interactivo MiradorPPT Del Paseo Museo Interactivo Mirador
PPT Del Paseo Museo Interactivo MiradorGabriela Lepe
 
evaluación distancia grado historia del arte_arte prehistórico
evaluación distancia grado historia del arte_arte prehistóricoevaluación distancia grado historia del arte_arte prehistórico
evaluación distancia grado historia del arte_arte prehistóricoarquitectura rosamariagal
 
MBC Group - Magnolia in the Media
MBC Group - Magnolia in the MediaMBC Group - Magnolia in the Media
MBC Group - Magnolia in the Mediabkraft
 
Phase transformation and volume collapse of sm bi under high pressure
Phase transformation and volume collapse of sm bi under high pressurePhase transformation and volume collapse of sm bi under high pressure
Phase transformation and volume collapse of sm bi under high pressureAlexander Decker
 
2014 10 zoomsquare fh technikum excerpt
2014 10 zoomsquare fh technikum excerpt2014 10 zoomsquare fh technikum excerpt
2014 10 zoomsquare fh technikum excerptzoomsquare
 
Induction session | AIESEC BABEZ
Induction session | AIESEC BABEZInduction session | AIESEC BABEZ
Induction session | AIESEC BABEZTinhinaneAH
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
S2 1 Intro Anva
S2 1 Intro AnvaS2 1 Intro Anva
S2 1 Intro Anvataecoep
 

Viewers also liked (20)

Learn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best PracticesLearn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best Practices
 
Ebook 5 steps to speak a new language
Ebook 5 steps to speak a new languageEbook 5 steps to speak a new language
Ebook 5 steps to speak a new language
 
Présentation Igloo - Forum Mind&Market 2016
Présentation Igloo - Forum Mind&Market 2016Présentation Igloo - Forum Mind&Market 2016
Présentation Igloo - Forum Mind&Market 2016
 
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
Vortrag Anlagensubstanzbewertung zur Konferenz Smart Maintenance 2015
 
Siscon Corporate Document
Siscon Corporate DocumentSiscon Corporate Document
Siscon Corporate Document
 
Informativa bluemoz italiano web
Informativa bluemoz italiano webInformativa bluemoz italiano web
Informativa bluemoz italiano web
 
Whopper lust
Whopper lustWhopper lust
Whopper lust
 
PPT Del Paseo Museo Interactivo Mirador
PPT Del Paseo Museo Interactivo MiradorPPT Del Paseo Museo Interactivo Mirador
PPT Del Paseo Museo Interactivo Mirador
 
evaluación distancia grado historia del arte_arte prehistórico
evaluación distancia grado historia del arte_arte prehistóricoevaluación distancia grado historia del arte_arte prehistórico
evaluación distancia grado historia del arte_arte prehistórico
 
MBC Group - Magnolia in the Media
MBC Group - Magnolia in the MediaMBC Group - Magnolia in the Media
MBC Group - Magnolia in the Media
 
Phase transformation and volume collapse of sm bi under high pressure
Phase transformation and volume collapse of sm bi under high pressurePhase transformation and volume collapse of sm bi under high pressure
Phase transformation and volume collapse of sm bi under high pressure
 
2014 10 zoomsquare fh technikum excerpt
2014 10 zoomsquare fh technikum excerpt2014 10 zoomsquare fh technikum excerpt
2014 10 zoomsquare fh technikum excerpt
 
Induction session | AIESEC BABEZ
Induction session | AIESEC BABEZInduction session | AIESEC BABEZ
Induction session | AIESEC BABEZ
 
Audio engineering timeline
Audio engineering timelineAudio engineering timeline
Audio engineering timeline
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Salud enfermedad
Salud enfermedadSalud enfermedad
Salud enfermedad
 
Geografia de grecia bach
Geografia de grecia bachGeografia de grecia bach
Geografia de grecia bach
 
S2 1 Intro Anva
S2 1 Intro AnvaS2 1 Intro Anva
S2 1 Intro Anva
 
LCCS Charity Golf & Gala Dinner
LCCS Charity Golf & Gala DinnerLCCS Charity Golf & Gala Dinner
LCCS Charity Golf & Gala Dinner
 
El Rubius
El RubiusEl Rubius
El Rubius
 

Similar to Hadoop dev 01

[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxAbderrahmanABID2
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy Hussain Sultan
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...phdAssistance1
 
Big Data in Action – Real-World Solution Showcase
 Big Data in Action – Real-World Solution Showcase Big Data in Action – Real-World Solution Showcase
Big Data in Action – Real-World Solution ShowcaseInside Analysis
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavSwapnil (Neil) Jadhav
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
How to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfHow to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfCareervira
 

Similar to Hadoop dev 01 (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
 
Big Data in Action – Real-World Solution Showcase
 Big Data in Action – Real-World Solution Showcase Big Data in Action – Real-World Solution Showcase
Big Data in Action – Real-World Solution Showcase
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
How to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdfHow to Become a Big Data Professional.pdf
How to Become a Big Data Professional.pdf
 

More from Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 

More from Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Xgboost
XgboostXgboost
Xgboost
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 

Recently uploaded

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 

Recently uploaded (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 

Hadoop dev 01

  • 1. NYC Data Science Academy Hadoop Application Development with Real Cases Hadoop Application Development with Real Cases
  • 2. NYC Data Science Academy Hadoop Application Development with Real Cases Multi-layer Model 2
  • 3. NYC Data Science Academy Hadoop Application Development with Real Cases Data Pyramid and Character  Business personnel  ETL Engineer  Data Warehouse Engineer  Analyzer  Data Visualization Engineer  IT supporter: Operation- Maintanence, Programmer 3
  • 4. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis  Analyze collected data with statistical methods on purpose, then understand and implement the result 4
  • 5. NYC Data Science Academy Hadoop Application Development with Real Cases Data Mining  Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply knowledge-discovery algorithms to large database and show the associations to the users.  Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning  Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis  Case: Beer and Diaper  Science: Detecting Novel Associations in Large Data Sets 5
  • 6. NYC Data Science Academy Hadoop Application Development with Real Cases Business Intelligence  BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) + Report (Demonstration)  Our course 6
  • 7. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis Algorithms  Popular Algorithms 7
  • 8. NYC Data Science Academy Hadoop Application Development with Real Cases Regression 8
  • 9. NYC Data Science Academy Hadoop Application Development with Real Cases Time Series Analysis
  • 10. NYC Data Science Academy Hadoop Application Development with Real Cases Classifier 10
  • 11. NYC Data Science Academy Hadoop Application Development with Real Cases Clustering 11
  • 12. NYC Data Science Academy Hadoop Application Development with Real Cases Association Rules 12
  • 13. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis  Data Analysis Tools 13
  • 14. NYC Data Science Academy Hadoop Application Development with Real Cases Popular Data Analysis Tools Ranking 14
  • 15. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis stages  stage 1: Dominate by Business personnel  stage 2: Dominate by both Business personnel and Analyzer  stage 3: Dominate by Analyzer 15
  • 16. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 1  Business staff set all the requirements and most analysis plans  According to experiences, Business staff select features, set threshold, and IT staff search, integrate data, analyzer make report  Feature selection and choice of threshold is based on experience and personal knowledge  Suitable for simple cases, analysis technique is equivalent to the simplest decision tree  Business staffs has valuable experiences and hard to be replaced, analyzers are just for graphing and is easily replaced  This is common in the traditional industry 16
  • 17. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 2  More complex. Business staffs could analyze a small number of data records while cannot figure out all the features and the relationship among them. They have no experience with large number of samples.  Analyzer come to clean data and select features, and finally build suitable model to solve problem.  Business staffs and analyzer could evaluate the result together, very likely to success. Analyzer prefer this step because their ability and value is confirmed. 17
  • 18. NYC Data Science Academy Hadoop Application Development with Real Cases Spammer in Wordpress
  • 19. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 3  Business staffs have no experience for the case, and cannot offer any useful prior knowledge  Data analyzers use various tools and models to mine the data and trying to have interesting discovery  It is analyzer’s ideal world, while it is likely to fail  Business staffs cannot get involved, and they dislike this stage 19
  • 20. NYC Data Science Academy Hadoop Application Development with Real Cases Step Forward  The first stage(Gold on the ground) -> The second stage(Gold beneath the ground) -> The third stage (Gold deeply buried)  If analyzers are reckless, business staffs will resist to help  Data analysis is rooted in the business background. The goal of analysis is increasing profit. Successful analysis could not be apart from business  Interesting topic is more important than the model 20
  • 21. NYC Data Science Academy Hadoop Application Development with Real Cases What is Big Data
  • 22. NYC Data Science Academy Hadoop Application Development with Real Cases Features of Big Data
  • 23. NYC Data Science Academy Hadoop Application Development with Real Cases Challenges for Analyzers  Bottleneck for both insertion and query due to the increasing amount of data  The trend of integrating users’ application and analysis result is asking for faster real-time computation and response time  More complex models require more expensive computation 23
  • 24. NYC Data Science Academy Hadoop Application Development with Real Cases Dilemma of Traditional Data Analysis Tools  R, SAS, SPSS are experimental tools  Capable data size is restricted by the memory size  Use Oracle database for large volume of data, but lack of professional and fast analyzing ability  Sampling is a limited solution, it is not useful for clustering and recommendation system  Solution: Hadoop cluster and Map-Reduce parallel computing 24
  • 25. NYC Data Science Academy Hadoop Application Development with Real Cases Case 1: analysis and monitor for a telecommunication company 25
  • 26. NYC Data Science Academy Hadoop Application Development with Real Cases Case 1: analysis and monitor for a telecommunication company  Configuration of the original database server: HP minicomputer, 128G memory, 48- core CPU, RAC with two nodes, one node for insertion and the other for query  Storage: HP virtual storage, over 1000 disks  Architecture: Oracle RAC with two nodes  Bottleneck: 1. Insertion 2. Query 26
  • 27. NYC Data Science Academy Hadoop Application Development with Real Cases Case 2: DNA database 27
  • 28. NYC Data Science Academy Hadoop Application Development with Real Cases Case 3: Social analysis, activity fingerprint detection  28| Public Voice mail intersect IMSI 1 IMSI 2 …… IMSI n total call duration User A IMSI 20% 12% …… 5% 365 User B IMSI 15% 13% …… 2% 310 Public SMS intersect IMSI 1 IMSI 2 …… IMSI n Monthly SMS count User A IMSI 50% 10% …… 5% 200 User B IMSI 20% 13% …… 2% 260 Public base station CGI 1 CGI 2 …… CGI n Shutdown User A IMSI 20% 12% …… 5% 20% User B IMSI 15% 13% …… 2% 5% Public Fingerprint (0.2, 0.12, …, 0.05) (0.15, 0.13, …, 0.02) (0.5, 0.1, …, 0.05) (0.2, 0.13, …, 0.02) (0.2, 0.12, …, 0.05, 0.2) (0.15, 0.13, …, 0.02, 0.05 eigenvector
  • 29. NYC Data Science Academy Hadoop Application Development with Real Cases  When equals to , these two vectors are independent When equals to 0 , these two vectors are perfectly dependent The closer is from 0, the more dependent these vectors are 90 Case 3: Social analysis, activity fingerprint detection 29
  • 30. NYC Data Science Academy Hadoop Application Development with Real Cases Case 3: Social analysis, VIP detection 30
  • 31. NYC Data Science Academy Hadoop Application Development with Real Cases Solution that analyzers look forward to  Perfectly eliminate the bottleneck in the foreseeable future  Smoothly transplant available techniques, for example SQL and R.  The cost of new platform: hardware and software, re-development, skill training, maintenance 31
  • 32. NYC Data Science Academy Hadoop Application Development with Real Cases Path to Big Data
  • 33. NYC Data Science Academy Hadoop Application Development with Real Cases Idea of Hadoop 33
  • 34. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce Programming 34
  • 35. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce program for meteorological data analysis 35
  • 36. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce implementation for popular algorithms 36
  • 37. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce implementation for popular algorithms 37
  • 38. NYC Data Science Academy Hadoop Application Development with Real Cases Why not Hadoop?  Java?  Hard to control?  Hard to integrate data?  Hadoop vs Oracle 38
  • 39. NYC Data Science Academy Hadoop Application Development with Real Cases Analysis under Hadoop system  Mainstream: Java program  Light-weighted script language: Pig  Smooth transplant from SQL: Hive  NoSQL: HBase 39
  • 40. NYC Data Science Academy Hadoop Application Development with Real Cases Family of Hadoop 40
  • 41. NYC Data Science Academy Hadoop Application Development with Real Cases pig  Pig could be treated as a client software to the hadoop, could connect to hadoop and analyze  Pig is convenient for users unfamiliar with java, using a SQL-like language, pig latin, dealing with data flow  Pig latin could perform sorting, filtering, sum, grouping, association, and define custom functions. It is a light-weighted script language for data operation and analysis  Pig could be treated as the mapping from pig latin to map-reduce 41
  • 42. NYC Data Science Academy Hadoop Application Development with Real Cases Hive  Data warehouse tool, could turn primary data structure in Hadoop into tables in Hive  Support HiveQL, a language almost the same as SQL, its function is the same as SQL except updating, indexing and  could be treated as the mapping from SQL to map-reduce  Offering interfaces for shell、 JDBC/ODBC、Thrift、Web 42
  • 43. NYC Data Science Academy Hadoop Application Development with Real Cases Features of Mahout  Mahout is for scalable machine learning algorithms (M-R implementation), and Hadoop platform is not necessary. The core library also have efficient algorithms on single machine  Mature and popular algorithms are 1. Frequent Itemset Mining 2. Clustering 3. Classifier 4. Recommendation System 5. Frequent Subgraph Mining 43
  • 44. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  • 45. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  • 46. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  • 47. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks 47
  • 48. NYC Data Science Academy Hadoop Application Development with Real Cases Typical Experiment Environtment(with server)  Server: ESXi, capable of deploying multiple virtual machines and could run 3 machines at the same time  PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine  SSH: Use command ssh under linux, and SecureCRT or putty under Windows to connect with remote linux server  Vmware client: Management of ESXi  Hadoop: Use version 1.x or 2.x 48
  • 49. NYC Data Science Academy Hadoop Application Development with Real Cases Typical Experiment Environtment(with only PC or laptop running Windows)  At Least 4G memory, 64bit windows is preferred, because 32bit machine can use only more than 3G memory.  Install vmware workstation or virtual box  Deploy 3 virtual machines and running at the same time. If can only run two VMs, treat host as a node (by cygwin), and use bridged networking for virtual network  Install Linux and Java  Old computers could consider pseudo-distributed environment 49
  • 50. NYC Data Science Academy Hadoop Application Development with Real Cases Experiment Environment  Deploy Pig  Deploy Hive  Deploy Mahout
  • 51. NYC Data Science Academy Hadoop Application Development with Real Cases List of Cases of the Course  Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)  LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map- Reduce)  User analysis for telecommunication company; Labeling duplicated users by the fingerprint of calls(Map-Reduce)  Recommendation system for E-commerce company(Map-Reduce)  Complicated recommendation system application(mahout)  Social network; Distance between users; Community detection(Pig)  Importance of nodes in a social network(Map-Reduce)  Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)  Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)  Set stock strategies with data analysis(Map-Reduce, Hive)  GPS application; Sign-in data analysis(Pig)  Implementation and optimization of sorting on Map-Reduce  Middleware development; Cooperation of multiple Hadoop clusters