The document discusses different NoSQL data models including key-value, document, column family, and graph models. It provides examples of popular NoSQL databases that implement each model such as Redis, MongoDB, Cassandra, and Neo4j. The document argues that these NoSQL databases address limitations of relational databases in supporting modern web applications with requirements for scalability, flexibility, and high performance.
An overview about several technologies which contribute to the landscape of Big Data.
An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online
Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.
Locality Sensitive Hashing (LSH) is a technique for finding similar items in large datasets. It works in 3 steps:
1. Shingling converts documents to sets of n-grams (sequences of tokens). This represents documents as high-dimensional vectors.
2. MinHashing maps these high-dimensional sets to short signatures or sketches, in a way that preserves similarity according to the Jaccard coefficient. It uses random permutations to select the minimum value in each permutation.
3. LSH partitions the signature matrix into bands and hashes each band separately, so that similar signatures are likely to hash to the same buckets. Candidate pairs are those that share buckets in one or more bands, reducing
MapReduce is a programming model for processing large datasets in a distributed system. It involves a map step that performs filtering and sorting, and a reduce step that performs summary operations. Hadoop is an open-source framework that supports MapReduce. It orchestrates tasks across distributed servers, manages communications and fault tolerance. Main steps include mapping of input data, shuffling of data between nodes, and reducing of shuffled data.
The document discusses different NoSQL data models including key-value, document, column family, and graph models. It provides examples of popular NoSQL databases that implement each model such as Redis, MongoDB, Cassandra, and Neo4j. The document argues that these NoSQL databases address limitations of relational databases in supporting modern web applications with requirements for scalability, flexibility, and high performance.
An overview about several technologies which contribute to the landscape of Big Data.
An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online
Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.
Locality Sensitive Hashing (LSH) is a technique for finding similar items in large datasets. It works in 3 steps:
1. Shingling converts documents to sets of n-grams (sequences of tokens). This represents documents as high-dimensional vectors.
2. MinHashing maps these high-dimensional sets to short signatures or sketches, in a way that preserves similarity according to the Jaccard coefficient. It uses random permutations to select the minimum value in each permutation.
3. LSH partitions the signature matrix into bands and hashes each band separately, so that similar signatures are likely to hash to the same buckets. Candidate pairs are those that share buckets in one or more bands, reducing
MapReduce is a programming model for processing large datasets in a distributed system. It involves a map step that performs filtering and sorting, and a reduce step that performs summary operations. Hadoop is an open-source framework that supports MapReduce. It orchestrates tasks across distributed servers, manages communications and fault tolerance. Main steps include mapping of input data, shuffling of data between nodes, and reducing of shuffled data.
Examples, techniques, and lessons learned building data products over the last 3 years at LinkedIn.
Pete Skomoroch is a Principal Data Scientist at LinkedIn where he leads a team focused on building data products leveraging LinkedIn's powerful identity and reputation data.
The talk describes some techniques and best practices applied to develop products like LinkedIn Skills & Endorsements.
This was the inaugural UberData Tech Talk, held in SF at Uber HQ.
Market Segmentation and Market Basket AnalysisSpotle.ai
Market segmentation involves dividing customers into subgroups where members are similar in certain characteristics and behaviors. It is useful for developing targeted marketing strategies. Common segmentation bases include demographics, behaviors, and geographic location. Methods like k-means clustering and hierarchical clustering are used. Market basket analysis examines what products customers frequently purchase together to identify cross-selling opportunities. Association rule mining and similarity measures are analytical approaches used. It provides benefits like increased sales and customer satisfaction from tailored product offerings.
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
Business Intelligence Presentation - Data Mining (2/2)Bernardo Najlis
In this second part of the Business Intelligence Presentation, we dive into Data Mining, what it is, its business applications and some CRM related examples.
This document discusses version stamps, which are fields that change each time underlying data is updated. They are used to check for data changes in NoSQL databases that lack transactions. Various methods for creating version stamps are described, including counters, GUIDs, hashes, and timestamps. The best approach may be a composite stamp using multiple methods to leverage their advantages and avoid single points of failure. Version stamps can enable consistency when reading and updating data.
Slides zum Impuls-Vortrag "Data Strategy & Governance" - BI or DIE LEVEL UP 2022
Aufzeichnung des Vortrags: https://www.youtube.com/watch?v=705DfyfF5-M
Architecture web aujourd'hui, besoin de scalabilité des bases de données relationnelles, découverte des bases de données NoSQL et des différents types de celles-ci. La vidéo de présentation peut être consultée à l'adresse suivante : http://youtu.be/oIpjcqHyx2M
Presentation delivered during the Introductory Course to Big Data in Agriculture. 29/11/2013, NCSR Demokritos, Athens, Greece.
The presentation is heavily based on the report titled “Big Data Now: 2012 Edition", by O’Reilly Media, Inc.
More info about the event: http://wiki.agroknow.gr/agroknow/index.php/Athens_Green_Hackathon_2013
The Persona-Based Value of Modern Data Governance Precisely
Yes, data governance solutions are now a business imperative. But modern demands are requiring integrated capabilities to discover, understand, profile, and measure data integrity across many different functions across your organization.
This presentation shares four persona-based use cases & demos to illustrate how a single modular, and interoperable solution can optimize collaboration and empower your data teams to deliver data-driven decisions faster and more confidently.
Are you ready for the future of data governance? Check out what will be required:
• Understand data relationships to business objectives, metrics, and request new actions
• Discover new data element alerts to profile and add contextual details to your analysis
• Review needed data quality rules, lineage, and impact and proactively
monitor data changes over time.
• Access & respond to data replication request for more timely results
• Create data quality pipelines and enrich data for more insightful analytics
In business, master data management is a method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference.
1. The document discusses rational decision making and business intelligence. It defines rational decision making as selecting the optimal alternative based on analyzing past data and considering various performance criteria.
2. It describes the typical cycle of a business intelligence analysis as involving defining objectives, generating insights from data analysis, making decisions based on insights, and evaluating performance.
3. Key components of business intelligence architectures are data sources, data warehouses/marts for storing and processing data, and business intelligence tools for generating insights and supporting decision making.
What are data products and why are they different from other products?inovex GmbH
This document discusses data products and how they differ from traditional products. It defines three types of data products: data as a service, data-enhanced products, and data as insights. The document highlights that data product management may require different processes and methods than traditional product management. It provides examples from interviews with companies about how they approach aspects like defining value propositions, positioning, and identifying data for data products. The conclusion is that data products are different and require categorizing them to help with business modeling and product definition.
Data architecture defines the target state for an information system by describing how data is processed, stored, and utilized. It shows data structures, flows, and usage across business applications and systems. Data architecture sets data standards and addresses both stored and moving data. Its benefits include higher quality, reduced costs, quicker time to market, clearer scope, faster performance, better documentation, fewer errors, and managed risks. Defining the target state involves conceptual, logical, and physical architectural processes to represent enterprise entities, their relationships, and specific data mechanisms. Influencers include requirements, technology, economics, policies, and processing needs. Principles include building decoupled systems, using the right tools, leveraging managed services, and using log-
This document provides an overview and introduction to MongoDB. It discusses how new types of applications, data, volumes, development methods and architectures necessitated new database technologies like NoSQL. It then defines MongoDB and describes its features, including using documents to store data, dynamic schemas, querying capabilities, indexing, auto-sharding for scalability, replication for availability, and using memory for performance. Use cases are presented for companies like Foursquare and Craigslist that have migrated large volumes of data and traffic to MongoDB to gain benefits like flexibility, scalability, availability and ease of use over traditional relational database systems.
Precision Match is India's largest video supply consortium, providing over 50 million video views per month across APAC. It has a library of over 10,000 hours of original video content from over 10 content providers across genres like entertainment, news, sports, women, and business/finance. Precision Match offers programmatic video advertising capabilities and the ability to disseminate brand messages across devices to targeted audiences.
Anahita Khorrami Banaraki has extensive education and experience in cognitive neuroscience, including a PhD in cognitive neuroscience. Her research interests include social cognition, neuromarketing, face and object processing in autism, and she has experience conducting research using EEG, ERP, eye tracking, TMS, and neuropsychological assessments. She has held various teaching and research positions focused on cognitive neuroscience and psychology.
This short document promotes creating presentations using Haiku Deck, a tool for making slideshows. It encourages the reader to get started making their own Haiku Deck presentation and sharing it on SlideShare. In just one sentence, it pitches the idea of using Haiku Deck to easily create engaging slideshows.
Examples, techniques, and lessons learned building data products over the last 3 years at LinkedIn.
Pete Skomoroch is a Principal Data Scientist at LinkedIn where he leads a team focused on building data products leveraging LinkedIn's powerful identity and reputation data.
The talk describes some techniques and best practices applied to develop products like LinkedIn Skills & Endorsements.
This was the inaugural UberData Tech Talk, held in SF at Uber HQ.
Market Segmentation and Market Basket AnalysisSpotle.ai
Market segmentation involves dividing customers into subgroups where members are similar in certain characteristics and behaviors. It is useful for developing targeted marketing strategies. Common segmentation bases include demographics, behaviors, and geographic location. Methods like k-means clustering and hierarchical clustering are used. Market basket analysis examines what products customers frequently purchase together to identify cross-selling opportunities. Association rule mining and similarity measures are analytical approaches used. It provides benefits like increased sales and customer satisfaction from tailored product offerings.
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
Business Intelligence Presentation - Data Mining (2/2)Bernardo Najlis
In this second part of the Business Intelligence Presentation, we dive into Data Mining, what it is, its business applications and some CRM related examples.
This document discusses version stamps, which are fields that change each time underlying data is updated. They are used to check for data changes in NoSQL databases that lack transactions. Various methods for creating version stamps are described, including counters, GUIDs, hashes, and timestamps. The best approach may be a composite stamp using multiple methods to leverage their advantages and avoid single points of failure. Version stamps can enable consistency when reading and updating data.
Slides zum Impuls-Vortrag "Data Strategy & Governance" - BI or DIE LEVEL UP 2022
Aufzeichnung des Vortrags: https://www.youtube.com/watch?v=705DfyfF5-M
Architecture web aujourd'hui, besoin de scalabilité des bases de données relationnelles, découverte des bases de données NoSQL et des différents types de celles-ci. La vidéo de présentation peut être consultée à l'adresse suivante : http://youtu.be/oIpjcqHyx2M
Presentation delivered during the Introductory Course to Big Data in Agriculture. 29/11/2013, NCSR Demokritos, Athens, Greece.
The presentation is heavily based on the report titled “Big Data Now: 2012 Edition", by O’Reilly Media, Inc.
More info about the event: http://wiki.agroknow.gr/agroknow/index.php/Athens_Green_Hackathon_2013
The Persona-Based Value of Modern Data Governance Precisely
Yes, data governance solutions are now a business imperative. But modern demands are requiring integrated capabilities to discover, understand, profile, and measure data integrity across many different functions across your organization.
This presentation shares four persona-based use cases & demos to illustrate how a single modular, and interoperable solution can optimize collaboration and empower your data teams to deliver data-driven decisions faster and more confidently.
Are you ready for the future of data governance? Check out what will be required:
• Understand data relationships to business objectives, metrics, and request new actions
• Discover new data element alerts to profile and add contextual details to your analysis
• Review needed data quality rules, lineage, and impact and proactively
monitor data changes over time.
• Access & respond to data replication request for more timely results
• Create data quality pipelines and enrich data for more insightful analytics
In business, master data management is a method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference.
1. The document discusses rational decision making and business intelligence. It defines rational decision making as selecting the optimal alternative based on analyzing past data and considering various performance criteria.
2. It describes the typical cycle of a business intelligence analysis as involving defining objectives, generating insights from data analysis, making decisions based on insights, and evaluating performance.
3. Key components of business intelligence architectures are data sources, data warehouses/marts for storing and processing data, and business intelligence tools for generating insights and supporting decision making.
What are data products and why are they different from other products?inovex GmbH
This document discusses data products and how they differ from traditional products. It defines three types of data products: data as a service, data-enhanced products, and data as insights. The document highlights that data product management may require different processes and methods than traditional product management. It provides examples from interviews with companies about how they approach aspects like defining value propositions, positioning, and identifying data for data products. The conclusion is that data products are different and require categorizing them to help with business modeling and product definition.
Data architecture defines the target state for an information system by describing how data is processed, stored, and utilized. It shows data structures, flows, and usage across business applications and systems. Data architecture sets data standards and addresses both stored and moving data. Its benefits include higher quality, reduced costs, quicker time to market, clearer scope, faster performance, better documentation, fewer errors, and managed risks. Defining the target state involves conceptual, logical, and physical architectural processes to represent enterprise entities, their relationships, and specific data mechanisms. Influencers include requirements, technology, economics, policies, and processing needs. Principles include building decoupled systems, using the right tools, leveraging managed services, and using log-
This document provides an overview and introduction to MongoDB. It discusses how new types of applications, data, volumes, development methods and architectures necessitated new database technologies like NoSQL. It then defines MongoDB and describes its features, including using documents to store data, dynamic schemas, querying capabilities, indexing, auto-sharding for scalability, replication for availability, and using memory for performance. Use cases are presented for companies like Foursquare and Craigslist that have migrated large volumes of data and traffic to MongoDB to gain benefits like flexibility, scalability, availability and ease of use over traditional relational database systems.
Precision Match is India's largest video supply consortium, providing over 50 million video views per month across APAC. It has a library of over 10,000 hours of original video content from over 10 content providers across genres like entertainment, news, sports, women, and business/finance. Precision Match offers programmatic video advertising capabilities and the ability to disseminate brand messages across devices to targeted audiences.
Anahita Khorrami Banaraki has extensive education and experience in cognitive neuroscience, including a PhD in cognitive neuroscience. Her research interests include social cognition, neuromarketing, face and object processing in autism, and she has experience conducting research using EEG, ERP, eye tracking, TMS, and neuropsychological assessments. She has held various teaching and research positions focused on cognitive neuroscience and psychology.
This short document promotes creating presentations using Haiku Deck, a tool for making slideshows. It encourages the reader to get started making their own Haiku Deck presentation and sharing it on SlideShare. In just one sentence, it pitches the idea of using Haiku Deck to easily create engaging slideshows.
The document provides notes on group theory. It discusses the definition of groups and examples of groups such as (Z, +), (Q, ×), and Sn. Properties of groups like Lagrange's theorem and criteria for subgroups are also covered. The notes then discuss symmetry groups, defining isometries of R2 and showing that the set of isometries forms a group. Symmetry groups G(Π) of objects Π in R2 are introduced and shown to be subgroups. Specific examples of symmetry groups like those of triangles, squares, regular n-gons, and infinite strips are analyzed. Finally, the concept of group isomorphism is defined and examples are given to illustrate isomorphic groups.
Content Marketing is so broad. So many options. So many techniques. So many tools.
Feeling overwhelmed? This is the perfect presentation for beginners (and more advanced experts) to make sure you are covering the basics.
Green Belt Six Sigma training was completed from 2011 to 2012. Additional training and certification was obtained from 2012 to 2013. This individual achieved Green Belt certification and continued developing their Six Sigma skills over this time period.
Search engine optimization (SEO) refers to techniques that help a website rank higher in search engines. SEO has two main components: on-page optimization, which involves changes to the website itself, and off-page optimization involving external work. Keywords are important to target for SEO and can be broad, general terms or specific, targeted phrases. Internal linking structure, including page depth, number of links to a page, and link quality, is also crucial for a website's SEO performance. The title tag, keyword targeting, and link structure are important on-page elements to optimize.
Kişisel Web Site Uygulama Örneği:myWebveliakcakaya
Üniversiteler için kişisel website uygulama örneği. Bir üniversitede kişisel web site hizmeti nasıl verileceği ile ilgili Akademik Bilişim 2009'dan yararlı bir sunum.
Eskişehir Osmangazi Üniversitesi, Eğitim Fakültesi, Bilgisayar ve Öğretim Teknolojileri Eğitimi Bölümünde açılan Web Tasarım ders notları. Web Sitesi Geliştirme Adımları
Google Apps yani Google uygulamaları, işinize yardımcı olacak temel iş uygulamalarını bir araya getiren bir Google hizmetidir. Tüm uygulama ve bilgileri internet üzerinde bulut yapısında bulunduğundan, çalışanlarınız internet bağlantısı olan herhangi bir yerden, herhangi bir bilgisayar ya da mobil cihazla, güvenli bir şekilde işlerini yapabilecektir.
3. GİRİŞ
• Web Madenciliği; birçok mühendisin görüşüne
göre ilk kez Etzioni tarafından 1996’da ortaya
atılmıştır[1]
• Web Madenciliği, Veri Madenciliğinin alt dalıdır.
• Veri Madenciliğinden yola çıkarak, Web
Madenciliği, Metin Madenciliği gibi alt dallar
ortaya çıkmıştır.
4. GİRİŞ
• Veri toplamanın önemini kavrayan her firma ,
kamu kuruluşu ve benzeri kuruluşlar, verileri
depolayarak onlardan sonuçlar çıkarmaktadır.
• Ancak veri tabanlarından sorgular yardımı ile elde
edilen bilgiler sadece sorgu bazlı bilgiler olup en
üst düzeyde faydaya ulaşılamamaktadır.
Verilerden en yüksek şekilde faydalanabilmenin
yolu veri madenciliğinden geçmektedir.
5. Veri, Metin ve Web Madenciliği
• Veri Madenciliği yerine göre alt dallara
ayrılmaktadır.
• Veri madenciliği mevcut veriden anlamlı bilgileri,
ilişkileri çıkarmada kullanılan tekniklere verilen
genel isimdir. Veri madenciliği yapısal veriyi analiz
edebilmekte iken; metin ve web madenciliği
yapısal olmayan verinin, veri madenciliğinde
kullanılmak üzere, yapısal hale
dönüştürülmesinde kullanılmaktadır.
6. Veri, Metin ve Web Madenciliği
• Farklı birçok alanda kullanılabilen veri
madenciliğinin alt alanlarından Metin ve Web
Madenciliği; yapısal olmayan verinin metin ve
web madenciliği yöntemleri ile yapısal hale
dönüştürülmesi ile başlar ve teknik işlemlerle
devam eder.
• Ancak her şeyden önce; yapısal olmayan verinin,
veri,web veya metin madenciliğinde
kullanılabilecek bir yapısal veri haline gelmesi
gerekmektedir.
7. Yapısal ve Yapısal Olmayan Veriler
• Yapısal veri, bir yapı içerisinde organize edilebilen
ve bundan dolayı tanımlanabilen veri için
kullanılan bir terimdir. Yapısal veri, içerikteki veri
tipine göre organize edilebilen ve arama
yapılabilen veridir.
• En yaygın kullanılan yapısal veri kaynakları SQL
(Structured Query Language) ve Access gibi veri
kaynaklarıdır. SQL kaynaklar için Oracle,
PostgreSQL, Microsoft SQL Server gibi yardımcı
database programları kullanılabilir.
8. Yapısal ve Yapısal Olmayan Veriler
• Buna karşın yapısal olmayan verinin
tanımlanabilir bir yapısı yoktur.
• En çok bilinen yapısal olmayan veri türleri; resim
dosyaları, pdf, word ve text gibi metin dosyaları,
web üzerinde tutulan log dosyaları ve
epostalardır. Excel gibi hücre yapısına sahip veri
türleri yapısal olmasına rağmen halen yapısal
olma ve olmama konusundaki yeri tartışmalıdır.
9. Veri Madenciliğinin Metin ve Web
Madenciliğindeki Rolü
• Veri madenciliği çözümleri ve algoritmalar metin
veya web verisindeki kalıplar bulmadan veya
model oluşturmadan önce metin veya web
verisinin yapısal olması gerekmektedir.
• Metin ve Web madenciliği işlemleri, veri
madenciliğinde kullanılacak yapısal veriye
ulaşmak için kullanılan araçlar olarak
tanımlanabilir
10. Metin ve Web Madenciliği
• Metin ve web madenciliği son yıllarda oldukça
fazla çalışılan birbiri ile ilişkili alanlardır. Metin
madenciliği, çok büyük belgelerin analizi ve metin
tabanlı verinin içerisindeki gizli kalıpların elde
edilmesidir.
• Web madenciliği ise, web içerikleri, sayfa yapıları
ve web bağlantı istatistiklerinin de içinde olduğu
web ile ilişkili olan verinin analizini içermektedir
[10].
11. Metin Madenciliği
• Kısaca Metin Madenciliğinden bahsedersek;
• Metin verisindeki anlamın ortaya
çıkarılabilmesi için kullanılan yöntem metin
madenciliğidir.
• Metin yazımında standart kurallar
olmadığından dolayı bilgisayar bunları
anlayamamaktadır.
12. Metin Madenciliği
• Yapısal olmayan bilgiden içerik çıkarmak için
kullanılan geleneksel yöntemler; dilbilimsel
olmayan yöntemlerdir.
• Bu yöntemler, hem sorgudaki hem de
metindeki kelimelerin karakterlerini
karşılaştıran bir temele dayanır. Bundan dolayı
içeriği açıklayıcı sonuçlar elde edemez.
13. Metin Madenciliği
• Dili anlamanın temeli dilbilimsel yollara dayanır
ve bu Natural Language Processing (NLP) olarak
ifade edilir.
• NLP’yi içeren bir sistemde, karmaşık yapıların
bulunduğu ifadeler (örneğin; duştan akan soğuk
su ile içilen soğuk su arasındaki fark gibi) akıllı
olarak çıkarabilmekte ve terimleri sınıflayarak;
ürünler, organizasyonlar veya kişiler gibi sınıflara
atamaktadır.
14. Web Madenciliği Giriş
• Tüm bu özetlerden sonra asıl konumuz olan
Web Madenciliğine giriş yapacağız.
• Başta da belirtildiği gibi Web Madenciliğini
anlayabilmek için Veri Madenciliğini anlamak
ve Metin Madenciliği hakkında yüzeysel bilgi
sahibi olmak gerekmektedir.
15. Web Madenciliği
• Web kullanım madenciliği, bir veya birçok web
sunucusundan kullanıcı erişim desenlerinin
otomatik keşfinin ve analizin yapıldığı bir tip veri
madenciliği etkinliğidir.
• Birçok kuruluş pazar analizleri için geliştirdikleri
stratejileri ziyaretçi bilgilerine dayanarak yerine
getirir. Kuruluşlar günlük operasyonlarla her gün
yüzlerce MB veri toplamaktadır.
16.
17. Web Madenciliği
• Bu bilgilerin çoğu web sunucuların otomatik
olarak tuttuğu günlük dosyalarından elde edilir.
Günlük dosyaları, istemciden sunucuya
gönderilen her bir isteğin bir kayıt olarak
eklenmesi ile meydana gelir.
• Günlük dosyalarının analizi, müşterilerin ilgi
alanları, ürünler üzerinden pazar stratejileri
oluşturma, promosyon kampanyalarının etkisi gibi
hususlarda, kurumlara karar süreçlerinde
yardımcı olur.
18. Web Madenciliği
• Sunucu erişim kayıtlarının ve kullanıcı kaydı
verilerinin analizi, aynı zamanda kurumun daha
etkili bir sunumunun yapılabilmesi için Web
sitesini nasıl daha iyi hale getirebileceği hakkında
değerli bilgiler sağlar.
• İntranet teknolojilerini kullanan kurumlarda, bu
tür analizler çalışma grubu iletişimi ve kurumsal
altyapının daha iyi işletilmesine ışık tutabilir.
19. Web Madenciliği
• Son olarak, World Wide Web üzerinden
reklam yapan kurumlar için kullanıcı erişim
desenlerini analiz etmek, reklamların belirli bir
kullanıcı grubuna yönlendirilmesine yardımcı
olur
• Web madenciliği alanları ve web kullanım
madenciliği aşamaları şeması bir sonraki
slaytta verilmiştir.
21. Web
Madenciliği
Web İçerik
Madenciliği
Web Yapı
Madenciliği
Web Kullanım
Madenciliği
Erişilebilir web
kaynaklarından
faydalı bilgi
bulmaya çalışır
Web sitesi ve
sayfalarının yapısal
olarak özetini
çıkarmaya çalışır
Kullanıcı erişimleri esnasında
oluşan hareket verisinden
anlamlı ve faydalı paternler
bulmaya çalışır
22. 1. Web İçerik Madenciliği
• Web içerik madenciliği ile web sayfalarının
içerikleri incelenir ve kullanışlı bilgi çıkarımı
sağlanır.
• Web içerik madenciliği kullanarak web
sayfalarının başlıklar, içerisinde geçen
kelimeler, resimler veya müzik dosyalar
incelenir. Bulunan içeriklere göre web siteleri
belirli sınıflara veya kümelere ayrılabilir
23. 1. Web İçerik Madenciliği
• Web içerik madenciliği web kaynaklarından
otomatik bilgi arama tekniklerini tanımlar. Verinin
farklı tiplerde oluşu ve yapısal olmayışı bu
konudaki tekniklere daha karışık yaklaşımlar
kazandırır.
• İki tip veri madenciliği stratejisi olabilir; metin
içeriklerini doğrudan arama ya da arama
motorları gibi araçların aramalarını yardımcı alan.
24. 2. Web Yapı Madenciliği
• Web erişim araçlarının çoğu çok değerli olabilecek
bağlantı(link) verisini gözardı ederek sadece text
verisine ulaşır, Web yapı madenciliğinin amacı
web sitesi ve web sayfası hakkında bağlantı
verisine bakarak bilgi üretmektir.
• Teknik olarak, Web içerik madenciliği dökümanın
içeriğine, yapı madenciliği ise dökümanlar arası
bağlantılara yoğunlaşır
25. 2. Web Yapı Madenciliği
• Yani web yapı madenciliği ile internetin temel
yapısını oluşturan web siteleri, web sayfaları
arası ya da web sayfasındaki bağlantılar
arasındaki ilişkiler incelenir.
26. 3. Web Kullanım Madenciliği
• Web kullanım madenciliği ile web
sunucularında tutulan kullanıcı erişim kayıtları
incelenerek anlamlı ve faydalı kalıplar
bulunabilir. Web kullanım madenciliği
yöntemleri uygulanarak web sitelerini ziyaret
eden kişilerin davranış ve tutumları
belirlenebilir
27. 3. Web Kullanım Madenciliği
• Web kullanım madenciliği kullanıcıların
web’de dolaşırken yaptıkları erişim
hareketlerince oluşturulan veriden bilgi
üretmeyi hedefler.
• Bu konudaki çalışmalar Genel Web Kullanım
Madenciliği, Site Güncelleme Sistemleri,
Sistem İyileştirme ve Kişiselleştirme başlıkları
altında toplanabilir.
28. 3. Web Kullanım Madenciliği
1. Genel Web Kullanım Madenciliği Sistemleri kullanıcıların
genel davranış biçimerini bilinen ya da önerilen veri
madenciliği algoritmalarını sunucu erişim dosyalarındaki
veriye uygulayarak bulmaya çalışır.
2. Site Günçelleştirme Sistemlerinin hedefi ise site içerik ve
yapısında yapılması gereken tadilatları bulmaktır.
3. Sistem İyileştirme üzerine yapılan araştırmalar web
kullanım verisini kullanarak trafiği etkinleştirmeyi hedefler.
4. Son olarak, kişiselleştirme çalışmaları bireysel taleplere
gore değişen siteler oluşturmaya çalışır
29. Patern Bulma Teknikleri
• Her web madenciliği işlemi çeşitli araştırma
alanlarından uyarlanan patern bulma
tekniğine ihtiyaç duyar.
31. Patern Bulma Teknikleri
• Tanımsal İstatistik : Web sitesindeki veriyi
tanımlamakta ve bilgi elde etmekte kullanılan
en güçlü teknikler istatistik metodlardır.
Analist farklı değişkenleri baz alan tanımlayıcı
istatistik analizler yapabilir.
32. Patern Bulma Teknikleri
• İlişkilendirme Kuralları (Association Rules):
Web alanında beraber kullanılan sayfalar
ilşkilendirme kuralları uygulanarak bulunup
aynı sunucuya konulabilirler. İlişkilendirme
kuralları genelllikle veri tabanındaki veriler
arasındaki ilşkileri tespit etmeye çalışır.
33. Patern Bulma Teknikleri
• Gruplama (Clustering) : Gruplama(kümeleme)
analizi veriler arasında benzer karakteristik
değerler taşıyanları bir araya getirerek gruplar
oluşturmayı hedefler.
• Sınıflandırma (Classification) : Bu teknikler
verileri ait oldukları tanımlı sınıflara koymaya
çalışır..
34. Patern Bulma Teknikleri
• Sıralı Paternler : Zamana yayılan veri kümeleri
arasında benzer paternler bulmaya çalışılır.
• Bağımlılık Modellemesi : Web değikenleri
arasındaki bağımlılıkları ortaya çıkaran
modeler oluşturmak hedeflenir.
35. SONUÇ
• Web madenciliğinin günümüzde birçok alanda
kullanılmasının en önemli sebebi; kişilerin web
sayfalarında göstermiş oldukları davranışların,
hareketlerin ve yapmış oldukları işlem
bilgilerinin var olan iş süreçlerine
entegrasyonunu sağlayarak müşterinin en iyi
şekilde anlaşılmasını sağlayan müşteri odaklı
bir sistem oluşturmasıdır.
36. Örnek
2002-01-06 13:45:24 65.116.145.138 - 193.255.141.93 80 GET
/dersler/grafik/Notes/default.html - 200
Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+DigExt)
38. Kaynaklar
• [1] Chakrabarti, S. (2003), Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann
• Publishers, San Francisco.
• [2] Dolgun, M.Ö. (2006), Büyük Al$veri$ Merkezleri Kçin Veri Madencilii Uygulamalar, Yüksek Lisans Tezi,
• Hacettepe Üniversitesi Fen Bilimleri Enstitüsü, Ankara.
• [3] Han, J., Kamber, M. (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San
• Francisco.
• [4] Hearst, M. (2009), What is text mining, http://www.sims.berkeley.edu/~hearst/textmining.html.
• [5] Introduction to Text Mining (2008), SPSS Inc.
• [6] Liu, B. (2007), Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer.
• [7] Özdemir Güzel, T., Dolgun, M.Ö., Patr, U., Delilolu, S., Korkmaz, H.E. (2007), 2005 Yl Örenci Seçme
• Snav (ÖSS) Verileri Kullanlarak Örenci Profilinin Belirlenmesi, 5. +statistik Kongresi, Antalya.
• [8] Shapiro-Piatetsky, G., Steingold, S. (2000), Measuring Lift Quality in Database Marketing, ACM SIGKDD
• Explorations Newsletter, 2(2), 76-80.
• [9] Sholom M.W., Indurkhya N., Zhang T., Damerau F. (2004), Text Mining: Predictive Methods for
• Analyzing Unstructured Information, Springer.
• [10] Tan, A.H., Yu, P.S. (2004), Guest Editorial: Text and Web Mining, Applied Intelligence 18, 239-241,
• Kluwer Academic Publisher.
• [11] Unstructured data (2009), http://en.wikipedia.org/wiki/Unstructured_data.
• [12] W. Fan, L. Wallace, S. Rich, Z. Zhang. (2006), Tapping into the power of text mining, Communications of
• ACM, 49(9), 76-82.
Editor's Notes
Web Mining (Örün Madenciliği) nedir anlayabilmek için öncelikle Data Mining (Veri Madenciliği) nedir bunu bilmeliyiz. Çünkü Web Mining; Data Mining konusunun bir alt dalıdır.
Kısaca Veri Madenciliği Nedir?
Kısaca Veri Madenciliği Nedir?
Kısaca Veri Madenciliği Nedir?
Peki Yapısal Veri Nedir?
Peki Yapısal Veri Nedir?
Veri Madenciliği ile Web, Metin Madenciliği arasındaki ilişiki