Basic explanation about graph mining for social network analysis (SNA). I tried to describe some metrics and benefit from SNA (focusing on telecommunication field). Basic spark with graphx script to analyse the graph also in the slide
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
My use case is to provide monitoring, and improving the overall search data quality, also to find the unusual patterns of user’s search behavior, and notifying the intent on-site back to the respective business stakeholders. To achieve the same, I explored various big data processing engines, which can process the huge data with complex business logic in real time. Eventually, I used Flink Stream processing. This talk will showcase how I used Flink to accomplish my goal.
What are Hash function and why is it used is security.
How to store passwords.
What are symmetric and asymmetric encryption function.
What is PGP program and how to use to encrypt and sign documents.
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
Event streaming architectures require high-throughput, low-latency components to consistently and smoothly transfer data between heterogenous transactional and analytical systems. Join us and Confluent's Tim Berglund to learn how the Scylla and Confluent Kafka interoperate as a foundation upon which you can build enterprise-grade, event-driven applications, plus a use case from Numberly.
In this slide deck, I first describe what resilience is, what it is about, why it is important and how it is different from traditional stability approaches.
After that introductory part the main part is a "small" pattern language which is organized around isolation, the typical starting point of resilient software design. I used quotation marks for "small" as even this subset of a complete resilience pattern language still consists of around 20 patterns.
All the patterns are briefly described and for some of the patterns I added a bit of detail, but as this is a slide deck, the voice track - as usual - is missing. Also this pattern language is still sort of work in progress, i.e., it has not yet settled and some details are still missing. Yet I think (or at least hope), that the slides might contain a few useful insights for you.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
My use case is to provide monitoring, and improving the overall search data quality, also to find the unusual patterns of user’s search behavior, and notifying the intent on-site back to the respective business stakeholders. To achieve the same, I explored various big data processing engines, which can process the huge data with complex business logic in real time. Eventually, I used Flink Stream processing. This talk will showcase how I used Flink to accomplish my goal.
What are Hash function and why is it used is security.
How to store passwords.
What are symmetric and asymmetric encryption function.
What is PGP program and how to use to encrypt and sign documents.
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
Event streaming architectures require high-throughput, low-latency components to consistently and smoothly transfer data between heterogenous transactional and analytical systems. Join us and Confluent's Tim Berglund to learn how the Scylla and Confluent Kafka interoperate as a foundation upon which you can build enterprise-grade, event-driven applications, plus a use case from Numberly.
In this slide deck, I first describe what resilience is, what it is about, why it is important and how it is different from traditional stability approaches.
After that introductory part the main part is a "small" pattern language which is organized around isolation, the typical starting point of resilient software design. I used quotation marks for "small" as even this subset of a complete resilience pattern language still consists of around 20 patterns.
All the patterns are briefly described and for some of the patterns I added a bit of detail, but as this is a slide deck, the voice track - as usual - is missing. Also this pattern language is still sort of work in progress, i.e., it has not yet settled and some details are still missing. Yet I think (or at least hope), that the slides might contain a few useful insights for you.
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
- Speaker: Nguyễn Hoàng Bách - Senior Principal Engineer @ TIKI
Trải qua 9 năm xây dựng và phát triển hệ thống, đội ngũ engineer TIKI lần lượt phải giải quyết từng bài toán kỹ thuật khó khăn để hệ thống phát triển theo kịp tốc độ tăng trưởng của business. Đặc thù của hệ thống Ecommerce có một thách thức lớn là phải đảm bảo tính chính xác của dữ liệu nhưng đồng thời vẫn phải đáp ứng lượng truy cập lớn. Do đó High Concurrency Architecture có vai trò quan trọng trong kiến trúc tổng thể của TIKI. Nó cũng là bước tiến lớn của các kỹ sư TIKI trong 6 tháng qua.
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
We’ve created some guidelines to help you use our brand and assets, including our logo, content and trademarks, without having to negotiate legal agreements for each use. To make any use of our marks in a way not covered by these guidelines, please contact us at brand@hashicorp.com and include a visual mockup of intended use.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
CNIT 129S: 9: Attacking Data Stores (Part 1 of 2)Sam Bowne
Slides for a college course based on "The Web Application Hacker's Handbook", 2nd Ed.
Teacher: Sam Bowne
Twitter: @sambowne
Website: https://samsclass.info/129S/129S_F16.shtml
This presentation illustrates a number of techniques to smuggle and reshape HTTP requests using features such as HTTP Pipelining that are not normally used by testers. The strange behaviour of web servers with different technologies will be reviewed using HTTP versions 1.1, 1.0, and 0.9 before HTTP v2 becomes too popular! Some of these techniques might come in handy when dealing with a dumb WAF or load balancer that blocks your attacks.
Presented @ BSides Manchester 2017 & SteelCon 2017
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
Monitoring is the key to successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast growing, high performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
- Speaker: Nguyễn Hoàng Bách - Senior Principal Engineer @ TIKI
Trải qua 9 năm xây dựng và phát triển hệ thống, đội ngũ engineer TIKI lần lượt phải giải quyết từng bài toán kỹ thuật khó khăn để hệ thống phát triển theo kịp tốc độ tăng trưởng của business. Đặc thù của hệ thống Ecommerce có một thách thức lớn là phải đảm bảo tính chính xác của dữ liệu nhưng đồng thời vẫn phải đáp ứng lượng truy cập lớn. Do đó High Concurrency Architecture có vai trò quan trọng trong kiến trúc tổng thể của TIKI. Nó cũng là bước tiến lớn của các kỹ sư TIKI trong 6 tháng qua.
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
We’ve created some guidelines to help you use our brand and assets, including our logo, content and trademarks, without having to negotiate legal agreements for each use. To make any use of our marks in a way not covered by these guidelines, please contact us at brand@hashicorp.com and include a visual mockup of intended use.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
CNIT 129S: 9: Attacking Data Stores (Part 1 of 2)Sam Bowne
Slides for a college course based on "The Web Application Hacker's Handbook", 2nd Ed.
Teacher: Sam Bowne
Twitter: @sambowne
Website: https://samsclass.info/129S/129S_F16.shtml
This presentation illustrates a number of techniques to smuggle and reshape HTTP requests using features such as HTTP Pipelining that are not normally used by testers. The strange behaviour of web servers with different technologies will be reviewed using HTTP versions 1.1, 1.0, and 0.9 before HTTP v2 becomes too popular! Some of these techniques might come in handy when dealing with a dumb WAF or load balancer that blocks your attacks.
Presented @ BSides Manchester 2017 & SteelCon 2017
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
Monitoring is the key to successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast growing, high performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
The fundamentals and best practices of securing your Hadoop cluster are top of mind today. In this session, we will examine and explain the components, tools, and frameworks used in Hadoop for authentication, authorization, audit, and encryption of data and processes. See how the latest innovations can let you securely connect more data to more users within your organization.
Data Mining In Social Networks Using K-Means Clustering Algorithmnishant24894
This topic deals with K-Means Clustering Algorithm which is used to categorize the data set into clusters depending upon their similarities like common interest or organization or colleges, etc. It categorize the data into clusters on the basis of mutual friendship.
EVOLVING PATTERNS IN BIG DATA - NEIL AVERYBig Data Week
Neil has a long history in large scale, distributed computing, ranging from architecting large scale risk-analytic systems, running startups, to data-heavy weather analysis using data-grids. He is a certified Datastax Solution Architect with Cassandra, Spark and Solr and was previously a Principle Architect at Thoughtworks where he lead several large scale projects in Retail, Logistic and other interesting fields.
Gephi is an open source software for graph and network analysis. It uses a 3D render engine to display large networks in real-time and to speed up the exploration. A flexible and multi-task architecture brings new pos- sibilities to work with complex data sets and produce valuable visual results. We present several key features of Gephi in the context of interactive exploration and interpretation of networks. It provides easy and broad access to network data and allows for spatializing, fil- tering, navigating, manipulating and clustering
Phishing as one of the most well-known cybercrime activities is a deception of online users to steal their personal or confidential information by impersonating a legitimate website. Several machine learning-based strategies have been proposed to detect phishing websites. These techniques are dependent on the features extracted from the website samples. However, few studies have actually considered efficient feature selection for detecting phishing attacks. In this work, we investigate an agreement on the definitive features which should be used in phishing detection. We apply Fuzzy Rough Set (FRS) theory as a tool to select most effective features from three benchmarked data sets. The selected features are fed into three often used classifiers for phishing detection. To evaluate the FRS feature selection in developing a generalizable phishing detection, the classifiers are trained by a separate out-of-sample data set of 14,000 website samples. The maximum F-measure gained by FRS feature selection is 95% using Random Forest classification. Also, there are 9 universal features selected by FRS over all the three data sets. The F-measure value using this universal feature set is approximately 93% which is a comparable result in contrast to the FRS performance. Since the universal feature set contains no features from third-part services, this finding implies that with no inquiry from external sources, we can gain a faster phishing detection which is also robust toward zero-day attacks.
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)paperpublications3
Abstract: The main aim of this project is secure the user login and data sharing among the social networks like Gmail, Facebook and also find anonymous user using this networks. If the original user not available in the networks, but their friends or anonymous user knows their login details means possible to misuse their chats. In this project we have to overcome the anonymous user using the network without original user knowledge. Unauthorized user using the login to chat, share images or videos etc This is the problem to be overcome in this project .That means user first register their details with one secured question and answer. Because the anonymous user can delete their chat or data In this by using the secured questions we have to recover the unauthorized user chat history or sharing details with their IP address or MAC address. So in this project they have found out a way to prevent the anonymous users misuse the original user login details.
Challenges in building a churn prediction model in different industries, presented by Jelena Pekez from Comtrade System Integration. Talk is focused on real-life use-case experience.
User Behavior Hashing for Audience ExpansionDatabricks
Learning to hash has been widely adopted as a solution to approximate nearest neighbor search for large-scale data retrieval in many applications. Applying deep architectures to learning to hash has recently gained increasing attention due to its computational efficiency and retrieval quality.
In this presentation how cloud is useful in big data analytics.It givers brief introduction to cloud service models and Big data 4V's.Here I'm describing how cloud is used in telecom and finance domain. How it is better than traditional methods.
Efficient Filtering Algorithms for Location- Aware Publish/subscribeIJSRD
Location-based services have been mostly used in many systems. preceding systems uses a pull model or user-initiated model, where a user arrival a query to a server which gives response with location-aware answers. To offer outcomes to users with fast responses, a push model or server-initiated model is flattering an important computing model in the next-generation location-based services. In the push model, subscribers arrive spatio-textual subscriptions to fastening their curiosities, and publishers send spatio-textual messages. It is used for a high-performance location-aware publish/subscribe system to send publishers’ messages to valid subscribers. In this paper, we find the exploration happenstances that start in manipulative a location-aware publish/subscribe system. We recommend an R-tree based index by merging textual descriptions into R-tree nodes. We design efficient filtering algorithms and effective pruning techniques to accomplish high performance. This method can support likewise conjunctive queries and ranking queries.
Identical Users in Different Social Media Provides Uniform Network Structure ...IJMTST Journal
The primary point of this venture is secure the client login and information sharing among the interpersonal organizations like Gmail, Face book and furthermore find unknown client utilizing this systems. On the off chance that the first client not accessible in the systems, but rather their companions or mysterious client knows their login points of interest implies conceivable to abuse their talks. In this venture we need to defeat the mysterious client utilizing the system without unique client information. Unapproved client utilizing the login to talk, share pictures or recordings and so on. This is the issue to be overcome in this venture .That implies client initially enlist their subtle elements with one secured question and reply. Since the unknown client can erase their talk or information. In this by utilizing the secured questions we need to recuperate the unapproved client talk history or imparting subtle elements to their IP address or MAC address. So in this venture they have discovered an approach to keep the mysterious clients abuse the first client login points.
Similar to Social Network Analysis with Spark (20)
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Definition
From the point of view of data mining, a social network is a heterogeneous and
multirelational data set represented by a graph. The graph is typically very
large, with nodes (or vertex) corresponding to objects and edges
corresponding to links representing relationships or interactions between
objects. Both nodes and links have attributes
(Han & Kamber, 2006).
2
Call, sms, IM, trf. Balance, …
mention, follow, like, …
subscriber subscriber
3. Benefit of SNA
3
Identify role of subscriber in
community:
• Community leader
• Bridge
• Passive
• Follower
Identify high value/prospect
community by looking at:
• Community size
• Closeness
• Member’s profile (device,
usage, ARPU, location)
• Onnet/Offnet share in
community
Suspected same
subscriber
Comparing two social network to
identify single identity of
subscriber. By comparing two
social network
Further
Utilization
• New product campaign, targeting community leader, bridge, and high value community
• Retention program prioritization for community leader, bridge, and high value community
• Product adoption campaign for follower in community that already adopt the product
• Identifying rotational churner to be excluded in retention campaign, or to evaluate dealer
• SN variable can be used to enhance another predictive model. For example: social network
variable can increase the lift of churn model for high value customer (Imaduddin, 2014)
4. Social Network Graph Mining
By mining the graph of social network, we can extract valuable information such
as:
• Degree (in-degree, out-degree, max-degree). Degree related to number of edge attached
to one vertex/node. Vertex with high number of in-degree means that vertex receive many
information from others, and vice versa.
• PageRank. PageRank measures the importance of each vertex in a graph. If a Twitter user
is followed by many others, the user will be ranked highly. For CDR based social network,
reverse the graph direction before use PageRank function to identify the important vertex
• Local clustering coefficient (LCC). LCC represent how close a customer’s network. The
higher the LCC, the closer the network. LCC calculation derived from triangle counting of
each vertex.
4
𝐿𝐶𝐶 =
#𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒
𝑛
2
, 𝑛 = #𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟
14. Solving Real World Problem
• Define the vertices. Is it subscriber, web pages, twitter account?
• Define the edge how the vertices connected. E.g. total call minutes in a month > 5
minutes, sms > 10, etc
• Identify the mega hubs. Mega hubs is vertex that connected to massive amount of vertices
(something like call center or spammer). Mega hubs can be removed, or process separately
based on the problem.
• Identify the measure needed (PageRank, degree, LCC, triangle, etc)
• Build the data source (separate the vertex properties data and the connection data – join it
later), and put it distributed on hadoop.
• Build the code, run it, and feed the result back to data warehouse or hadoop for further
utilization
14
15. References & Resources
• Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques. San Francisco: Morgan Kaufmann.
• Imaduddin, G. (2014). Evaluation and Improvement of Churn Model Using Customer Value and Social
Network. Jakarta: Universitas Indonesia.
15
• Apache Spark Overview. https://spark.apache.org/docs/latest/
• Databricks Training Resources. https://databricks.com/spark-training-resources
• GraphX Programming Guide. https://spark.apache.org/docs/latest/graphx-
programming-guide.html
• Social Network Analysis. http://en.wikipedia.org/wiki/Social_network_analysis
• Spark Scala API Doc.
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.pac
kage
• The Scala Programming Language. http://www.scala-lang.org/