SlideShare a Scribd company logo
1 of 25
BigData - Introduction
  Walter Dal Mut – walter.dalmut@gmail.com
  @walterdalmut - @corleycloud - @upcloo
Whoami
• Walter Dal Mut                                                    •     Corley S.r.l.
   • Startupper                                                           • Load and Stress tests
       • Corley S.r.l.
            • http://www.corley.it/                                          • www.corley.it
       • UpCloo Ltd.                                                      • Scalable CMS
            • http://www.upcloo.com/
   • Electronic Engineer
                                                                             • www.xitecms.it
       • Polytechnic of Turin                                             • Consultancy
• Social                                                                     • PHP
   • @walterdalmut - @corleycloud - @upcloo                                  • AWS
• Websites                                                                   • Distributed Systems
   • walterdalmut.com – corley.it – upcloo.com                               • Scalable Systems

                                 Walter Dal Mut walter.dalmut@gmail.com @walterdalmut
                                                 @corleycloud @upcloo
Wait a moment… What is it?
• Big data is a collection of data sets
   • So large and complex
   • Difficult to process:
         • using on-hand database management tools
         • traditional data processing
• Big data challenges
   •   Capture
   •   Curation
   •   Storage
   •   Search
   •   Sharing
   •   Transfer
   •   Analysis
   •   Visualization
Common scenario
• Data arriving at fast rates
• Data are typically unstructured or partially structured
   • Log files
   • JSON/BSON structures
• Data are stored without any kind of aggregation
What we are looking for?
• Scalable Storage
   • In 2010 Facebook claimed that they had 21 PB of storage
       • 30PB in 2011
       • 100PB in 2012
       • 500PB in 2013
• Massive Parallel Processing
   • In 2012 we can elaborate exabytes (10^18 bytes) of data in a reasonable amount of
     time
   • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000
     core Linux cluster and produces data that is used in every Yahoo! Web search query.
• Resonable Cost
   • The New York Times used 100 Amazon EC2 instances and a Hadoop application to
     process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the
     space of 24 hours at a computation cost of about $240 (not including bandwidth)
What is Apache Hadoop?
• Apache Hadoop is an open-source software framework that supports data-
  intensive distributed applications.
   • It was derived from Google’s MapReduce and Google File System (GFS) papers
   • Hadoop implements a computational paradigm named: MapReduce
• Hadoop architecture
   • Hadoop Kernel
   • MapReduce
   • Hadoop Distributed File System (HDFS)
• Other project created in top of Hadoop
   •   Hive, Pig
   •   Hbase
   •   ZooKeeper
   •   Etc.
What is MapReduce?
           SLICE 0

           SLICE 1     R0

           SLICE 2

           SLICE 3
 Input
           SLICE 4     R1     Output

           SLICE 5

           SLICE 6     R2

           SLICE 7

            MAP      REDUCE
Word Count with Map Reduce
The Hadoop Ecosystem
        Data Mining           Data Access           Client Access
         Mahout                 Sqoop                  Hive, Pig


                               Network


        Data Storage        Data Processing         Coordination
        HDFS, HBase          MapReduce               ZooKeeper


                       JVM (Java Virtual Machine)


              Operating System (RedHat, Ubuntu, etc)


                               Hardware
Getting Started with Apache Hadoop
• Hadoop is a framework
• We need to write down our software
  • A Map
  • A Reduce
  • Put all togheter

• WordCount MapReduce with Hadoop
  • http://wiki.apache.org/hadoop/WordCount
(MAP) WordCount MapReduce with
Hadoop
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();


    public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
}
(RED) WordCount MapReduce with
Hadoop
public static class Reduce
    extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterable<IntWritable> values,
Context context)
      throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
WordCount MapReduce with Hadoop
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount");


    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);


    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));


    job.waitForCompletion(true);
}
Scripting Languages on top of Hadoop
Apache HIVE – Data Warehouse
• Hive is a data warehouse system for Hadoop
   • Facilitates easy data summarization
   • Ad-hoc queries
   • Analysis of large datasets stored in Hadoop compatible file systems
• SQL-like language called HiveQL
• Custom plug-in MapReduce if needed
Apache Hive Features
• SELECTs and FILTERs
   • SELECT * FROM table_name;
   • SELECT * FROM table_name tn WHERE tn.weekday = 12;
• GROUP BY
   • SELECT * FROM table_name GROUP BY weekday;
• JOINs
   • SELECT a.* FROM a JOIN b ON (a.id = b.id)
• More (multitable inserts, streaming)

• Example:
  https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingS
  tarted-SimpleExampleUseCases
A real example with a CSV (tab-
separated)
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE
u_data;

SELECT COUNT(*) FROM u_data;
Apache PIG
• It is a platform for analyzing large data sets that consists of a high-
  level language for expressing data analysis programs

• Pig's infrastructure layer consists of a compiler that produces
  sequences of Map-Reduce programs
   • Ease of programming
   • Optimization opportunities
      • Auto-optimization of tasks, allowing the user to focus on semantics rather than
        efficiency.
   • Extensibility
Apache Pig – Log Flow analysis
• A common scenario is collect logs and analyse it as post processing
• We will cover Apache2 (web server) log analysis
   • Read AWS Elastic Map Reduce  Apache Pig example
      • http://aws.amazon.com/articles/2729
• Apache Log:
   • 122.161.184.193 - - [21/Jul/2009:13:14:17 -0700] "GET /rss.pl
     HTTP/1.1" 200 35942 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
     6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022;
     InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618;
     OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
Log Analysis [Common Log Format
(CLF)]
• More information about CLF:
   • http://httpd.apache.org/docs/2.2/logs.html
• Log interesting fields:
   •   Remote Address (IP)
   •   Remote Log Name
   •   User
   •   Time
   •   Request
   •   Status
   •   Bytes
   •   Refererer
   •   Browser
Getting started with Apache Pig
• First of all we need to load all logs from a HDFS compabile source (for
  example AWS S3)
• Create a table starting from all log lines
• Create the Map Reduce program in order to analyse and store results
Load logs from S3 and create a table
RAW_LOGS = LOAD ‘s3://bucket/path’ USING TextLoader as (line:chararray);


LOGS_BASE = FOREACH RAW_LOGS GENERATE
   FLATTEN(
      EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)"')
   )
   as (
       remoteAddr:     chararray, remoteLogname: chararray,
       user:           chararray, time:         chararray,
       request:        chararray, status:       int,
       bytes_string:   chararray, referrer:     chararray,
       browser:        chararray
 );
LOGS_BASE is a Table
Remote      logNm   User   Time        Request   Status   Bytes   Refererer   Browser
72.14.194.1 -       -      21/Jul/20   GET        200     2969    -           FeedFetch
                           09:18:04:   /gwidgets                              er-
                           54 -0700    /alexa.xml                             Google;
                                       HTTP/1.1                               (+http://w
                                                                              ww.googl
                                                                              e.com/fee
                                                                              dfetcher.h
                                                                              tml)
Data Analysis
REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;

FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR
referrer matches '.*google.*';

SEARCH_TERMS = FOREACH FILTERED GENERATE
FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as
terms:chararray;

SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL;

SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0)
GENERATE $0, COUNT($1) as num;

SEARCH_TERMS_COUNT_SORTED = ORDER SEARCH_TERMS_COUNT BY num DESC;
Thanks for Listening
• Any questions?

More Related Content

What's hot

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 

What's hot (20)

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 

Similar to Big data, just an introduction to Hadoop and Scripting Languages

Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudAndrei Savu
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 

Similar to Big data, just an introduction to Hadoop and Scripting Languages (20)

SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
מיכאל
מיכאלמיכאל
מיכאל
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 

More from Corley S.r.l.

Aws rekognition - riconoscimento facciale
Aws rekognition  - riconoscimento faccialeAws rekognition  - riconoscimento facciale
Aws rekognition - riconoscimento faccialeCorley S.r.l.
 
AWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container servicesAWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container servicesCorley S.r.l.
 
AWSome day 2018 - API serverless with aws
AWSome day 2018  - API serverless with awsAWSome day 2018  - API serverless with aws
AWSome day 2018 - API serverless with awsCorley S.r.l.
 
AWSome day 2018 - database in cloud
AWSome day 2018 -  database in cloudAWSome day 2018 -  database in cloud
AWSome day 2018 - database in cloudCorley S.r.l.
 
Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing Corley S.r.l.
 
Apiconf - The perfect REST solution
Apiconf - The perfect REST solutionApiconf - The perfect REST solution
Apiconf - The perfect REST solutionCorley S.r.l.
 
Apiconf - Doc Driven Development
Apiconf - Doc Driven DevelopmentApiconf - Doc Driven Development
Apiconf - Doc Driven DevelopmentCorley S.r.l.
 
Authentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructuresAuthentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructuresCorley S.r.l.
 
Flexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructuresFlexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructuresCorley S.r.l.
 
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented applicationCloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented applicationCorley S.r.l.
 
A single language for backend and frontend from AngularJS to cloud with Clau...
A single language for backend and frontend  from AngularJS to cloud with Clau...A single language for backend and frontend  from AngularJS to cloud with Clau...
A single language for backend and frontend from AngularJS to cloud with Clau...Corley S.r.l.
 
AngularJS: Service, factory & provider
AngularJS: Service, factory & providerAngularJS: Service, factory & provider
AngularJS: Service, factory & providerCorley S.r.l.
 
The advantage of developing with TypeScript
The advantage of developing with TypeScript The advantage of developing with TypeScript
The advantage of developing with TypeScript Corley S.r.l.
 
Angular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deployAngular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deployCorley S.r.l.
 
Corley cloud angular in cloud
Corley cloud   angular in cloudCorley cloud   angular in cloud
Corley cloud angular in cloudCorley S.r.l.
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Corley S.r.l.
 
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS LambdaRead Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS LambdaCorley S.r.l.
 
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...Corley S.r.l.
 
Middleware PHP - A simple micro-framework
Middleware PHP - A simple micro-frameworkMiddleware PHP - A simple micro-framework
Middleware PHP - A simple micro-frameworkCorley S.r.l.
 

More from Corley S.r.l. (20)

Aws rekognition - riconoscimento facciale
Aws rekognition  - riconoscimento faccialeAws rekognition  - riconoscimento facciale
Aws rekognition - riconoscimento facciale
 
AWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container servicesAWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container services
 
AWSome day 2018 - API serverless with aws
AWSome day 2018  - API serverless with awsAWSome day 2018  - API serverless with aws
AWSome day 2018 - API serverless with aws
 
AWSome day 2018 - database in cloud
AWSome day 2018 -  database in cloudAWSome day 2018 -  database in cloud
AWSome day 2018 - database in cloud
 
Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing
 
Apiconf - The perfect REST solution
Apiconf - The perfect REST solutionApiconf - The perfect REST solution
Apiconf - The perfect REST solution
 
Apiconf - Doc Driven Development
Apiconf - Doc Driven DevelopmentApiconf - Doc Driven Development
Apiconf - Doc Driven Development
 
Authentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructuresAuthentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructures
 
Flexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructuresFlexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructures
 
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented applicationCloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
 
React vs Angular2
React vs Angular2React vs Angular2
React vs Angular2
 
A single language for backend and frontend from AngularJS to cloud with Clau...
A single language for backend and frontend  from AngularJS to cloud with Clau...A single language for backend and frontend  from AngularJS to cloud with Clau...
A single language for backend and frontend from AngularJS to cloud with Clau...
 
AngularJS: Service, factory & provider
AngularJS: Service, factory & providerAngularJS: Service, factory & provider
AngularJS: Service, factory & provider
 
The advantage of developing with TypeScript
The advantage of developing with TypeScript The advantage of developing with TypeScript
The advantage of developing with TypeScript
 
Angular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deployAngular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deploy
 
Corley cloud angular in cloud
Corley cloud   angular in cloudCorley cloud   angular in cloud
Corley cloud angular in cloud
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
 
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS LambdaRead Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
 
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
 
Middleware PHP - A simple micro-framework
Middleware PHP - A simple micro-frameworkMiddleware PHP - A simple micro-framework
Middleware PHP - A simple micro-framework
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Big data, just an introduction to Hadoop and Scripting Languages

  • 1. BigData - Introduction Walter Dal Mut – walter.dalmut@gmail.com @walterdalmut - @corleycloud - @upcloo
  • 2. Whoami • Walter Dal Mut • Corley S.r.l. • Startupper • Load and Stress tests • Corley S.r.l. • http://www.corley.it/ • www.corley.it • UpCloo Ltd. • Scalable CMS • http://www.upcloo.com/ • Electronic Engineer • www.xitecms.it • Polytechnic of Turin • Consultancy • Social • PHP • @walterdalmut - @corleycloud - @upcloo • AWS • Websites • Distributed Systems • walterdalmut.com – corley.it – upcloo.com • Scalable Systems Walter Dal Mut walter.dalmut@gmail.com @walterdalmut @corleycloud @upcloo
  • 3. Wait a moment… What is it? • Big data is a collection of data sets • So large and complex • Difficult to process: • using on-hand database management tools • traditional data processing • Big data challenges • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 4. Common scenario • Data arriving at fast rates • Data are typically unstructured or partially structured • Log files • JSON/BSON structures • Data are stored without any kind of aggregation
  • 5. What we are looking for? • Scalable Storage • In 2010 Facebook claimed that they had 21 PB of storage • 30PB in 2011 • 100PB in 2012 • 500PB in 2013 • Massive Parallel Processing • In 2012 we can elaborate exabytes (10^18 bytes) of data in a reasonable amount of time • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query. • Resonable Cost • The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
  • 6. What is Apache Hadoop? • Apache Hadoop is an open-source software framework that supports data- intensive distributed applications. • It was derived from Google’s MapReduce and Google File System (GFS) papers • Hadoop implements a computational paradigm named: MapReduce • Hadoop architecture • Hadoop Kernel • MapReduce • Hadoop Distributed File System (HDFS) • Other project created in top of Hadoop • Hive, Pig • Hbase • ZooKeeper • Etc.
  • 7. What is MapReduce? SLICE 0 SLICE 1 R0 SLICE 2 SLICE 3 Input SLICE 4 R1 Output SLICE 5 SLICE 6 R2 SLICE 7 MAP REDUCE
  • 8. Word Count with Map Reduce
  • 9. The Hadoop Ecosystem Data Mining Data Access Client Access Mahout Sqoop Hive, Pig Network Data Storage Data Processing Coordination HDFS, HBase MapReduce ZooKeeper JVM (Java Virtual Machine) Operating System (RedHat, Ubuntu, etc) Hardware
  • 10. Getting Started with Apache Hadoop • Hadoop is a framework • We need to write down our software • A Map • A Reduce • Put all togheter • WordCount MapReduce with Hadoop • http://wiki.apache.org/hadoop/WordCount
  • 11. (MAP) WordCount MapReduce with Hadoop public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 12. (RED) WordCount MapReduce with Hadoop public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 13. WordCount MapReduce with Hadoop public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 14. Scripting Languages on top of Hadoop
  • 15. Apache HIVE – Data Warehouse • Hive is a data warehouse system for Hadoop • Facilitates easy data summarization • Ad-hoc queries • Analysis of large datasets stored in Hadoop compatible file systems • SQL-like language called HiveQL • Custom plug-in MapReduce if needed
  • 16. Apache Hive Features • SELECTs and FILTERs • SELECT * FROM table_name; • SELECT * FROM table_name tn WHERE tn.weekday = 12; • GROUP BY • SELECT * FROM table_name GROUP BY weekday; • JOINs • SELECT a.* FROM a JOIN b ON (a.id = b.id) • More (multitable inserts, streaming) • Example: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingS tarted-SimpleExampleUseCases
  • 17. A real example with a CSV (tab- separated) CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; SELECT COUNT(*) FROM u_data;
  • 18. Apache PIG • It is a platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs • Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs • Ease of programming • Optimization opportunities • Auto-optimization of tasks, allowing the user to focus on semantics rather than efficiency. • Extensibility
  • 19. Apache Pig – Log Flow analysis • A common scenario is collect logs and analyse it as post processing • We will cover Apache2 (web server) log analysis • Read AWS Elastic Map Reduce  Apache Pig example • http://aws.amazon.com/articles/2729 • Apache Log: • 122.161.184.193 - - [21/Jul/2009:13:14:17 -0700] "GET /rss.pl HTTP/1.1" 200 35942 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
  • 20. Log Analysis [Common Log Format (CLF)] • More information about CLF: • http://httpd.apache.org/docs/2.2/logs.html • Log interesting fields: • Remote Address (IP) • Remote Log Name • User • Time • Request • Status • Bytes • Refererer • Browser
  • 21. Getting started with Apache Pig • First of all we need to load all logs from a HDFS compabile source (for example AWS S3) • Create a table starting from all log lines • Create the Map Reduce program in order to analyse and store results
  • 22. Load logs from S3 and create a table RAW_LOGS = LOAD ‘s3://bucket/path’ USING TextLoader as (line:chararray); LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
  • 23. LOGS_BASE is a Table Remote logNm User Time Request Status Bytes Refererer Browser 72.14.194.1 - - 21/Jul/20 GET 200 2969 - FeedFetch 09:18:04: /gwidgets er- 54 -0700 /alexa.xml Google; HTTP/1.1 (+http://w ww.googl e.com/fee dfetcher.h tml)
  • 24. Data Analysis REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer matches '.*google.*'; SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as terms:chararray; SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, COUNT($1) as num; SEARCH_TERMS_COUNT_SORTED = ORDER SEARCH_TERMS_COUNT BY num DESC;
  • 25. Thanks for Listening • Any questions?

Editor's Notes

  1. In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.