Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
Szybkie wprowadzenie do technologii Pig i Hive z ekosystemu Hadoop. Prezentacja wykonana w ramach warsztatów Codepot w dniu 29.08.2015. Prezentacja wykonana przez Radosława Stankiewicza oraz Bartłomieja Tartanusa.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Slides for the Cluj.py meetup where we explored the inner workings of CPython, the reference implementation of Python. Includes examples of writing a C extension to Python, and introduces Cython - ultimately the sanest way of writing C extensions.
Also check out the code samples on GitHub: https://github.com/trustyou/meetups/tree/master/python-c
Cluj Big Data Meetup - Big Data in PracticeSteffen Wenz
At the Cluj Big Data Meetup, we shared some insights into TrustYou's big data tech stack. Also we introduced two tools which we've found useful in our production jobs: Apache Pig and Luigi.
Also check out the code samples on GitHub: https://github.com/trustyou/meetups/tree/master/big-data
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseSages
Szybkie wprowadzenie do technologii Pig i Hive z ekosystemu Hadoop. Prezentacja wykonana w ramach warsztatów Codepot w dniu 29.08.2015. Prezentacja wykonana przez Radosława Stankiewicza oraz Bartłomieja Tartanusa.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Slides for the Cluj.py meetup where we explored the inner workings of CPython, the reference implementation of Python. Includes examples of writing a C extension to Python, and introduces Cython - ultimately the sanest way of writing C extensions.
Also check out the code samples on GitHub: https://github.com/trustyou/meetups/tree/master/python-c
Cluj Big Data Meetup - Big Data in PracticeSteffen Wenz
At the Cluj Big Data Meetup, we shared some insights into TrustYou's big data tech stack. Also we introduced two tools which we've found useful in our production jobs: Apache Pig and Luigi.
Also check out the code samples on GitHub: https://github.com/trustyou/meetups/tree/master/big-data
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
Speaker: Andrzej Dyjak
Language: English
In recent years security industry started to grow fond of Apple’s iOS and OS X platforms. This talk will cover one of XNU's flagship debugging utilities: DTrace, a dynamic tracing framework for troubleshooting kernel and application problems on production systems in real time. It will be shown how it can be used in order to ease various tasks within the realm of dynamic binary analysis and beyond.
CONFidence: http://confidence.org.pl/
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Many projects use at least some of them - the Jakarta Commons libraries. Small reusable libraries simplifying the day-to-day work of thousands of java programmers. But over time the jakarta commons project has grown and the number of components makes it harder and harder to keep track. This session will try to give an overview of the components available and how the Jakarta Commons community is organized.
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
Speaker: Andrzej Dyjak
Language: English
In recent years security industry started to grow fond of Apple’s iOS and OS X platforms. This talk will cover one of XNU's flagship debugging utilities: DTrace, a dynamic tracing framework for troubleshooting kernel and application problems on production systems in real time. It will be shown how it can be used in order to ease various tasks within the realm of dynamic binary analysis and beyond.
CONFidence: http://confidence.org.pl/
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Many projects use at least some of them - the Jakarta Commons libraries. Small reusable libraries simplifying the day-to-day work of thousands of java programmers. But over time the jakarta commons project has grown and the number of components makes it harder and harder to keep track. This session will try to give an overview of the components available and how the Jakarta Commons community is organized.
Introduction to Hadoop Map Reduce, Pig, Hive and HBase technologies.
Workshop deck prepared and presented on May 30th 2015 by Radosław Stankiewicz.
During that day participants had also possibility to go through prepared tutorials and test their analysis on real cluster.
Владимир Тагаков. Dagger2: dependency injection in AndroidMail.ru Group
Владимир Тагаков, независимый разработчик, рассказал про Dagger 2 — библиотеку от Google.
Он рассмотрел возможности современных решений Dependency Injection в Android-среде и продемонстрировал преимущества и недостатки подходов с использованием Depencency Injection.
Zapowiedź raportu - Stan Androida w Polsce 2015Piotr Biegun
Zapowiedź raportu opisującego stan Androida w Polsce.
„Dlaczego powstał ten raport?”
Mamy dużo dobrych raportów od IAB, Generation Mobile, Jestm Mobi, Monika Mikowska. Posiadają genialne dane, dobrze opracowane i idealne od podejmowania decyzji przez dyrektorów, strategów.
W bieżącej pracy z aplikacjami mobilnymi te dane niestety nie dają nam pewności, że decyjze projektowe, które podejmujemy mają jakikolwiek sens.
Chcemy to zmienić dlatego publikujemy ten raport. Jeśli widzisz jakieś błędy. Masz dane, które obalają to co wyszło z badań - napisz do nas. Obiecujemy, że wszystkie wnioski i opinie, które mają sens, są oparte o dane chętnie dołączymy do naszego raportu i strony.
Stworzone przez Whalla Labs i SW Research - opublikowane w trakcie Infoshare 2015.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
GDG Devfest 2019 - Build go kit microservices at kubernetes with easeKAI CHU CHUNG
Gokit is microservice tookit and use Service/Endpoint/Transport to strict separation of concerns design. This talk to use go-kit develop microservice application integrate with consul, zipkin, prometheus, etc service and deploy on Kubernetes.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
ASTs are an incredibly powerful tool for understanding and manipulating JavaScript. We'll explore this topic by looking at examples from ESLint, a pluggable static analysis tool, and Browserify, a client-side module bundler. Through these examples we'll see how ASTs can be great for analyzing and even for modifying your JavaScript. This talk should be interesting to anyone that regularly builds apps in JavaScript either on the client-side or on the server-side.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstraction makes our approach independent of the actual Hadoop version and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We revisit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 without any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
11. Klasyfikacja problemu
• Baza danych ulic Warszawy, Dane w formacie JSON,
optymalizacja odbioru śmieci jednego z usługodawców.
• Zdarzenia z bazy transakcyjnej i kart kredytowych w
celu lepszego wykrywania fraudów
• System wyszukujący dobre oferty samochodów z wielu
serwisów - web crawling, parsowanie danych, analiza
trendów cen samochodów
• Centralne repozytorium skanów umów, TB danych,
codziennie przybywa kilkaset nowych dokumentów
11
12. Geneza
• za dużo danych
• pady serwerów
• wolne relacyjne bazy danych
12
19. ● User Commands
o dfs
o fsck
● Administration Commands
o datanode
o dfsadmin
o namenode
dfs:
appendToFile cat chgrp chmod chown copyFromLocal copyToLocal count cp du
dus expunge get getfacl getfattr getmerge ls lsr mkdir moveFromLocal
moveToLocal mv put rm rmr setfacl setfattr setrep stat tail test text touchz
hdfs dfs -put localfile1 localfile2 /user/tmp/hadoopdir
hdfs dfs -getmerge /user/hadoop/output/ localfile
komendy
19
20. HDFS - uprawnienia
• prawie POSIX
• Users, Groups
• chmod, chgrp, chown
• ACL
• getfacl, setfacl
• można wyłączyć kontrolę uprawnień
• dodatkowo:
• Apache Knox
• Apache Ranger
24. Mapper
#!/usr/bin/env python
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print '%st%s' % (word, 1)
line = “Ala ma kota”
Ala 1
ma 1
kota 1
24
25. Reducer
#!/usr/bin/env python
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '%s,%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s,%s' % (current_word, current_count)
ala 1
ala 1
bela 1
dela 1
ala,2
bela,1
dela,1
25
27. Map Reduce w Java
(input) <k1, v1> -> map -> <k2, v2> -> combine ->
<k2, v2> -> reduce -> <k3, v3> (output)
1) Mapper
2) Reducer
3) run
public class WordCount extends Configured
implements Tool {
public static class TokenizerMapper{...}
public static class IntSumReducer{...}
public int run(...){...}
}
27
28. Mapper<KEYIN,VALUEIN,KEY
OUT,VALUEOUT>
public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
public void setup(...) {...}
public void cleanup(...) {...}
public void run(...) {...}
}
value = “Ala ma kota”
Ala,1
ma,1
kota,1
29. Reducer<KEYIN,VALUEIN,KEY
OUT,VALUEOUT>
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
public void setup(...) {...}
public void cleanup(...) {...}
public void run(...) {...}
}
kota,(1,1,1,1)
kota,4
30. Main
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new WordCount(),args);
System.exit(res);
}
yarn jar wc.jar WordCount /tmp/wordcount/input /tmp/wordcount/output
31. Co dalej?
• Map Reduce w Javie
• Testowanie MRUnit
• Joins
• Avro
• Custom Key, Value
• Złączanie wielu zadań
• Custom Input, Output
31
35. Czy warto?
Top 5 stron odwiedzanych przez
użytkowników mających 18 lat
36. import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's from and
store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," + s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
37. public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',', firstComma);
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable, LongWritable,
Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable, Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
38. public static void main(String[] args) throws IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites for users
18 to 25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
}
}
39. Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
46. Podstawy Pig Latin -
wielkość liter
• A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);
DUMP C;
• Nazwy zmiennych A, B, and C (tzw. aliasy) są case sensitive.
• Wielkość liter jest też istotna dla:
• nazwy pól f1, f2, i f3
• nazwy zmiennych A, B, C
• nazwy funkcji PigStorage, COUNT
• Z wyjątkiem: LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, oraz DUMP
46
48. Pierwsze kroki
data = LOAD 'input' AS (query:CHARARRAY);
A = LOAD 'data' USING PigStorage('t') AS (f1:int, f2:int, f3:int);
STORE A INTO '/tmp/result' USING PigStorage(';')
48
50. Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
50
51. Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
C = LIMIT B 5;
51
52. Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
B = FILTER A BY age > 20;
C = LIMIT B 5;
D = FOREACH C GENERATE name, scholarship*semestre as funds
52
53. Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
E = GROUP A by age
53
54. Kolejne kroki - operacje na
danych
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, semestre:int,
scholarship:float);
E = GROUP A by age
F = FOREACH E GENERATE group as age, AVG(A.scholarship)
54
61. Unikalne cechy Hive
Znaczne przyspieszenie analizy - nie potrzeba pisać Map Reduce
Optymalizacja, wykonywanie części operacji w pamięci zamiast MR
61
64. Hive CLI
Tryb Interaktywny
hive
Tryb Wsadowy:
hive -e ‘select foo from bar’
hive -f ‘/path/to/my/script.q’
hive -f ‘hdfs://namenode:port/path/to/my/
script.q’
więcej opcji: hive --help
64
65. Typy danych
INT, TINYINT, SMALLINT, BIGINT
BOOLEAN
DECIMAL
FLOAT, DOUBLE
STRING
BINARY
TIMESTAMP
ARRAY, MAP, STRUCT, UNION
DATE
CHAR
VARCHAR
65
66. Składnia zapytań
SELECT, INSERT, UPDATE
GROUP BY
UNION
LEFT, RIGHT, FULL INNER, FULL OUTER JOIN
OVER, RANK
(NOT) IN, HAVING
(NOT) EXISTS
66
67. Data Definition Language
• CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
• DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
• TRUNCATE TABLE
• ALTER DATABASE/SCHEMA, TABLE, VIEW
• MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
• SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES,
PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
• DESCRIBE DATABASE/SCHEMA, table_name, view_name
67
68. Tabele
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
STORED AS TEXTFILE;
68
69. Pierwsze kroki w Hive
CREATE TABLE tablename1 (foo INT, bar STRING) PARTITIONED BY (ds STRING);
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename1;
INSERT INTO TABLE tablename1 PARTITION (ds='2014') select_statement1 FROM
from_statement;
69
70. Inne formaty plików? SerDe
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/
start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
CREATE TABLE apachelog (
host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING,
size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?:
([^ "]*|".*") ([^ "]*|".*"))?"
)
STORED AS TEXTFILE;
70
71. Inne formaty plików? SerDe
CREATE TABLE table (
foo STRING, bar STRING)
STORED AS TEXTFILE; ← lub SEQUENCEFILE, ORC, AVRO lub PARQUET
71
72. Zalety, wady, porównanie
Hive Pig
deklaratywny proceduralny
tabele tymczasowe pipeline
polegamy na optymalizatorze
bardziej ingerujemy w
implementacje
UDF, Transform UDF, streaming
sterowniki sql data pipeline splits
72
79. Chcesz wiedzieć więcej?
Szkolenia pozwalają na indywidualną pracę z
każdym uczestnikiem
• pracujemy w grupach 4-8 osobowych
• program może być dostosowany do oczekiwań
grupy
• rozwiązujemy i odpowiadamy na indywidualne
pytania uczestników
• mamy dużo więcej czasu :)
80. Szkolenie dedykowane dla Ciebie
Interesuje Cię tematyka warsztatu?
Zapoznaj się z programami szkoleń:
Projektowanie rozwiązań Big Data z wykorzystaniem Apache
Hadoop & Family
Analiza danych tekstowych i języka naturalnego
Przetwarzanie Big Data z użyciem Apache Spark
Podstawy uczenia maszynowego w języku Python