SlideShare a Scribd company logo
Large  scale  text  processing  pipeline  
with  Spark  ML  and  GraphFrames
Alexey  Svyatkovskiy,  Kosuke Imai,  Jim  Pivarski
Princeton  University
Outline
• Policy  diffusion  detection  in  U.S.  legislature:   the  problem  
• Solving  the  problem  at  scale
• Apache  Spark
• Text  processing  pipeline:  core  modules
• Text  processing  workflow
• Data  ingestion
• Pre-­processing  and  feature  extraction
• All-­pairs  similarity  calculation
• Reformulating  problem  as  a  network  graph  problem
• Interactive  analysis
• All-­pairs  similarity  join
• Candidate  selection:  clustering,  hashing
• Policy  diffusion  detection  modes
• Interactive  analysis  with  Histogrammar tool
• Conclusion  and  next  steps
Policy  diffusion  detection:  the  problem
• Policy  diffusion  detection  is  a  problem  from  a  wider  class  of  fundamental  text  mining  problems  
of  finding  similar  items
• Occurs  when  government  decisions  in  a  given  jurisdiction  are  systematically  influenced  by  
prior  policy  choices  made  in  other  jurisdictions,  in  a  different  state  on  a  different  year
• Example:  “Stand  your  ground”  bills  first  introduced  in  
Florida,  Michigan  and  South  Carolina  2005
– A  number  of  states  have  passed  a  form  of  
SYG  bills  in  2012  after  T.  Martin’s  death
• We  focus  on  a  type  of  policy  diffusion  that  can  be  
detected  by  examining  similarity  of  bill  texts
States  that  have  passed  SYG  laws
States  that  have  passed  SYG  laws  since  T.  Martin’s  death
States  that  have  proposed  SYG  laws  after  T.  Martin’s  death
Source:  LCAV.org
Anatomy  of  Spark  applications
• Spark  uses  master/worker  architecture  with  central  coordinators  (drivers)  and  many  
distributed  workers  (executors)
• We  choose  Scala  for  our  implementation  (both  driver  and  executors)  because  unlike  Python  
and  R  it  is  statically  typed,  and  the  cost  of  JVM  communication  is  minimal
Cluster  manager
Driver  program
Worker  node/container
Worker  node/container
SparkContext
Executor Cache
Task Task
Executor Cache
Task Task
• A  10  node  SGI  Linux  Hadoop  cluster      
• Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU  
processors,  256  GB  RAM      
• All  the  servers  mounted  on  one  rack  and  
interconnected  using  a  10  Gigabit  Ethernet  switch      
• Cloudera distribution  of  Hadoop  configured  in  high-­
availability  mode  using  two  namenodes
• Schedule  Spark  applications  using  YARN      
• Distributed  file  system  (HDFS)  
• Alternative  configuration  uses  SLURM  resource  manager  
deploying  Spark  in  a  standalone  mode
Hardware  specifications
Text  processing  pipeline:  core  modules
All-pairs document similaritycalculation
Represent document pairs and similarities as
weighted undirected graph.
Perform PageRank and graph queries using
GraphFrames package
Raw unstructured text data
from a static source
(HDFS), orstreaming
(Kafka or Akkainput
source)
Structured, compressed datain
Avro format
Pre-processing, Feature extraction,
Dimension reduction
Candidate selection:clustering, hashing
Data  ingestion Distributed processing withSpark ML
Statistical analysis, aggregation using
Histogrammarpackage,plottingusing
Bokeh and Matplotlib front-ends
Interactive analysis
Data  ingestion
• Our  dataset  is  based  on  the  LexisNexis  StateNet dataset
which  contains  a  total  of  more  than  7  million  legislative  
bills  from  50  US  states  from  1991  to  2016
• The  initial  dataset  contains  unstructured  text  documents  
sub-­divided  by  year  and  state
• We  use  Apache  Avro  serialization  framework  to  store  the  
data  and  access  it  efficiently  in  Spark  applications
– Binary  JSON  meta-­format  with  schema  stored  with  the  data
– Row-­oriented,  good  for  nested  schemas.  Supports  schema  evolution
Unique  identifier  of  the  bill
Used  to  construct  predicates  and  filter
the  data  before  calculating  joins
Entire  contents  of  the  bill  as  a  string,  not  
read  into  memory  during  candidate  selection  
and  filtering  steps,  thanks  to  the  Avro  schema  
evolution  property
Example  raw  bill
Analysis  workflow1
Extract  candidate  
documents:
those  feature  vectors
that  we  need
to  test  for  similarity
Prepare  multisets of  tokens.
N  consecutive  words,
optionally  add  
bag-­of-­word features.  
Feature  vectors  are  so  high  
dimension,or  so  many  that  
they  cannot  all  fit  in  memory.
SVD  performs  semantic  
decomposition  of  TF-­IDF  
matrix  and  extracts  concepts  
which  are  most  relevant  for  
clustering
Cleaning,  
normalization,  
stopword
removal
Tokenization,  
n-­gram
TF-­IDF K-­means  
clustering
All-­pairs  
similarity  in  
each  cluster
Truncated  SVD
Feature  weighting  to  reflect  
importance  of  a  term  t to  a  
document  d in  corpus  D
Words  appearing  very  
frequently  across  the  
corpus  bare  little  meaning
Raw  text  of  legislative  
proposals  contains  spurious  
white  spaces  and  non-­
alphanumeric  characters,
which  bare  no  meaning  for  
analysis  and  often  represent  
an  obstacle  for  tokenization
Dataframe API
RDD-­based  (linear  
algebra,  BLAS) RDD-­based
Analysis  workflow2
Extract  candidate  
documents:
a pair  of  documents  
that  hashes  into  the  
same  bucket  for  a  
fraction  
of  bands  makes  a  
candidate  pair
Prepare  sets  of  tokens  
or  shingles.
N  consecutive  words  
/or  characters
N-­gram/N-­
shingle
Min  hashing
Locality  
sensitive  
hashing
All-­pairs  
similarity  in  
each  cluster
Extract  minhash signatures:
integer  vectors  that  
represent  the  sets,  and
reflect  their  similarity
Dataframe API
(optional)  
Cleaning,  
normalization,  
stopword
removal
Spark  ML:  a  quick  review
• Spark  ML  provides  standardized  API  for  machine  learning  algorithms  to  make  it  easier  
to  combine  multiple  algorithms  into  a  single  pipeline  or  workflow
– Similarly  to  scikit-­learn  Python  library
Basic  components  of  a  Pipeline are:    
• Transformer:   an  abstract  class  to  apply  a  transformation  to  dataset/dataframes
– UnaryTransformer abstract  class:  takes  an  input  column,  applies  transformation,  
and  output  the  result  as  a  new  column
– Has  a  transform() method
• Estimator: implements an  algorithm  which  can  be  fit  on  a  dataframe to  produce  a  
Transformer.   For  instance: a  learning  algorithm  is  an  Estimator  which  is  trained  on  a  
dataframe to  produce  a  model.    
– Has  a  fit() method
• Parameter:  an  API  to  pass  parameters  to  Transformers  and  Estimators  
Putting  it  all  into  Pipeline
• Preprocessing  and  feature  
extraction  steps  
– Use  StopWordsRemover,  RegexTokenizer,
MinHashLSH,  HashingTF transformers    
– IDF estimator  
• Prepare  custom  transformers  
to  plug  into  Pipeline
– Ex:  extend  UnaryTransformer,  
override  createTransformerFunc
using  custom  UDF
• Clustering KMeans and  LSH
• Ability  to  perform  hyper-­parameter
search  and  cross-­validation
Example  ML  pipeline  (see  backup  for  a  full  snippet)
Example  custom  transformer
All-­pairs  similarity:  overview
• Our  goal  is  to  go  beyond  identifying  the  diffusion  topics:  “stand  your  ground  
bills”,  cyberstalking,  marijuana  laws.  But  also  to  perform  an  all-­pairs  
comparison  
• Previous  work  in  policy  diffusion  has  been  unable  to  make  an  all-­pairs  
comparison  between  bills  for  a  lengthy  time  period  because  of  computational  
intensity
– Brute-­force  all-­pairs  calculation   between  the  texts  of  the  state  bills  requires  calculating  a  
cross-­join,  yielding  O(10#$)  distinct  pairs  on  the  dataset  considered
– As  a  substitute,  scholars  studied  single  topic  areas
• Focusing  on  the  document  vectors  which  are  likely  to  be  highly  similar  is  
essential  for  all-­pairs  comparison  at  scale
• Modern  studies  employ  variations  of  nearest-­neighbor  search,  locality  sensitive  
hashing  (LSH),  as  well  as  sampling  techniques  to  select  a  subset  of  rows  of  
TF-­IDF  matrix  based  on  the  sparsity (DIMSUM)
• Our  approach  utilizes  clustering  and  hashing  methods  (details  on  the  next  
slide)
All-­pairs  similarity,  workflow1:  clustering
• First  approach  utilizes  K-­means  clustering  to  identify  groups  of  candidate  documents  which  are  likely  
to  belong  to  the  same  diffusion  topic,  reducing  the  number  of  comparisons  in  the  all-­pairs  similarity  join  
calculation
– Distance-­based  clustering  using  fast  square  distance  calculation  in  Spark  ML
– Introduce  a  dataframe column  with  a  cluster  label.  Perform  
all-­pairs  calculation  within  each  cluster
– Determine  the  optimum  number  of  clusters  empirically,  by  repeating  the  
calculation  for  a  range  of  values  of  k,  scoring  on  a  processing  time  
versus  WCSSE  plane
– 150  clusters  for  a  3  state  subset,  400  clusters  for  the  entire  dataset
– 2-­3  orders  of  magnitude  less  combinatorial  pairs  to  calculate  compared  
to  the  brute-­force  approach  
Brute  force  solution
is  too  slow…
All-­pairs  similarity,  workflow2:  LSH
• N-­shingle  features  with  relatively  large  N  >  5,  hashed,  converted  to  sets
• Characteristic  matrix  instead  of  TF-­IDF  matrix  (values  0  or  1)
• Extract  MinHash signatures  for  each  column  (document)  using  a  family  
of  hash  functions  h1(x),  h2(x),  h3(x),  …  hn(x)
– Hash  several  times
– The  similarity  of  two  signatures  is  the  fraction  of  the  hash  functions  in  which  they  agree
• LSH:  focus  on  pairs  of  signatures  which  are  likely  to  be  from  similar  
documents
– Hash  columns  of  signature  matrix  M  to  many  buckets
– Partition  signature  matrix  into  bands  of  rows.  For  each  band,  hash  each  column  into  
k  buckets
– A  pair  of  documents  that  hashes  into  the  same  bucket  for  a  fraction  
of  bands  makes  a  candidate  pair
• Use  MinHashLSH class  in  Spark  2.1
Characteristic  matrix:
N-­shingles  rows  
M-­documents columns  
Signature  matrix:
K-­number  of  hash  
functions  rows  
M-­documents  columns  
Ex:  Family  
of  4  hash  
functions
Similarity  measures
• The  Jaccard similarity between  two  sets  is  the  size  of  their  intersection  divided  by  the  size  
of  their  union:
J(𝐴, 𝐵) =	
  
|-∩/|
|-∪/|
=  
|-∩/|
- 1 / 2|-∩/|
– Key  distance:  consider  feature  vectors  as  sets  of  indices
– Hash  distance:  consider  dense  vector  representations  
• Cosine  distance  between  feature  vectors
C(A,B)=
(-6/)
|-|6|/|
– Convert  distances  to  similarities  assuming  inverse  proportionality,  
rescaling  it  to  [0,100]  range,  adding  a  regularization  term
Feature  type Similarity measure
Workflow1:  K-­means  
clustering
Unigram, TF-­IDF,  truncated  
SVD
Cosine, Euclidean
Workflow2:  LSH N-­gram with  N=5,  MinHash Jaccard
𝐴
𝐵
Interactive  analysis  with  Histogrammar tool
• Histogrammar  is  a  suite  of  composable aggregators  with
– Language  independent  specification  with  implementations  
in  Scala  and  Python
– Grammar  of  composable aggregation  routines
– Draws  plots  in  Matplotlib and  Bokeh
– Unlike  RDD.histogram,  Histogrammar let’s  one  build  2D,
profiles,  sum-­of-­squares  statistics  in  an  open-­ended  way  
http://histogrammar.org
Contact  Jim  Pivarski
or  me  to  contribute!
Example  interactive  session  with  Histogrammar
GraphFrames
• GraphFrames  is  an  extension  of  Spark  allowing  to  perform  graph  queries  and  
graph  algorithms  on  Spark  dataframes
– A  GraphFrame is  constructed  using  two  dataframes (a  dataframe of  nodes  
and  an  edge  dataframe),  allowing  to  easily  integrate  the  graph  processing  
step  into  the  pipeline  along  with  Spark  ML  
– Graph  queries:  like  a  Cypher  query  on  a  graph  database  (e.g.  Neo4j)
– Graph  algorithms:  PageRank,  Dijkstra
– Possibility  to  eliminate  joins
bill1 bill2 similarity
FL/2005/SB436   MI/2005/SB1046   91.38
FL/2005/SB436   MI/2005/HB5143   91.29
FL/2005/SB436 SC/2005/SB1131   82.89
Currently  undirected,
switch  to  directed  by  time
SB346
SB1046
HB5143
SB1131
For  similarity  above  
threshold
Applications  of  policy  diffusion  detection  tool
• The  policy  diffusion  detection  tool  can  be  used  in  a  number  of  modes:
– Identification  of  groups  of  diffused  bills  in  the  dataset  
given  a  diffusion  topic  (for  instance,  “Stand  your  ground”  policy,  
cyberstalking,  marijuana  laws  …)
– Discovery  of  diffusion  topics:  consider  top-­similarity  bills  within  each  cluster,  
careful  filtering  of  uniform  bills  and  interstate  compact  bills  is  necessary  as  
they  would  show  high  similarity  as  well
– Identification  of  minimum  cost  paths  connecting  two  specific  legislative  
proposals  on  a  graph
– Identification  of  the  most  influential  US  states  for  policy  diffusion
• The  political  science  research  paper  on  applications  of  the  tool  is  currently  in  
progress
Performance  summary,  details  on  the  Spark  configuration
• The  policy  diffusion  analysis  pipeline  uses  map,  filter  (narrow),  join,  
and  aggregateByKey (wide)  transformations
– Deterministic  all-­pairs  calculation  from  approach1  involves  a  two-­sided  
join:  heavy  shuffle  with  O(100  TB)  intermediate  data
– Benefit  from  increasing  spark.shuffle.memoryFraction to  0.6
• Spark  applications  have  been  deployed  on  Hadoop  cluster  with  YARN
– 40  executor  containers,  each  using  3  executor  cores  and  15  GB  of  
RAM  per  JVM  
– Use  external  shuffle  service  inside  the  YARN  node  manager  to  improve  
stability  of  memory-­intensive  jobs  with  larger  number  of  executor  containers
– Custom  partitioning  to  avoid  struggler  tasks
• Calculate  efficiency  of  parallel  execution  as
𝐸 =
89
:;<;=68>
Single  executor  case
Conclusions
• Evaluated  Apache  Spark  framework  for  the  case  of  data-­intensive  
machine  learning  problem  of  policy  diffusion  detection  in  US  legislature  
– Provided  a  scalable  method  to  calculate  all-­pairs  similarity  based  on  K-­means  
clustering  and  MinHashLSH
– Implemented  a  text  processing  pipeline  utilizing  Apache  Avro  serialization  
framework,  Spark  ML,  GraphFrames,  and  Histogrammar suite  of  data  
aggregation   primitives  
– Efficiently  calculate  all-­pairs  comparison  between   legislative  bills,  estimate  
relationships  between   bills  on  a  graph,  potentially  applicable  to  a  wider  class  of  
fundamental   text  mining  problems  of  finding  similar  items
• Tuned  Spark  internals    (partitioning,  shuffle)  to  obtain  good  scaling  up  
to  O(100)  processing  cores,  yielding  80%  parallel  efficiency
• Utilized  Histogrammar tool  as  a  part  of  the  framework  to  enable  
interactive  analysis,  allows  a  researcher  to  perform  analysis  in  Scala  
language,  integrating  well  with  Hadoop  ecosystem  
Thank  You.
Alexey  Svyatkovskiy
email:  alexeys@princeton.edu
twitter:  @asvyatko
Kosuke Imai:  kimai@princeton.edu
Jim  Pivarski:  pivarski@fnal.gov
Backup
Service  node  1
Zookeeper
Journal  node
Primary  namenode
httpfs
Service  node  2
Zookeeper
Journal  node
Resource  manager
Hive  master
Service  node  3
Zookeeper
Journal  node
Standby  namenode
History  server
Datanodes
Spark
HDFS
Datanode service
Login  node
Spark
Hue
YP  server
LDAP
…
SSH
• A  10  node  SGI  Linux  Hadoop  cluster      
• Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU  processors,  256  GB  RAM      
• All  the  servers  mounted  on  one  rack  and  interconnected  using  a  10  
Gigabit  Ethernet  switch      
• Cloudera distribution  of  Hadoop  configured  in  high-­availability  mode  using  two  
namenodes
• Schedule  Spark  applications  using  YARN      
• Distributed  file  system  (HDFS)  
TF-­IDF
• TF-­IDF  weighting:  reflect  importance  of  a  term  t to  a  document  d in  corpus  D
– HashingTF transformer,  which  maps  feature  to  an  index  by  applying  MurmurHash 3
– IDF estimator  down-­weights  components  which  appear  frequently  in  the  corpus
𝐼𝐷𝐹(𝑡, 𝐷) = log
𝐷 + 1
𝐷𝐹 𝑡, 𝐷 + 1
𝑇𝐹𝐼𝐷𝐹 𝑡, 𝑑, 𝐷 = 𝑇𝐹 𝑡, 𝑑 ●𝐼𝐷𝐹(𝑡, 𝐷)
Dimension  reduction:  truncated  SVD
• SVD  is  applied  to  the  TF-­IDF  document-­feature  matrix  to  perform  semantic  decomposition  
and  extract  concepts  which  are  most  relevant  for  classification
• RDD-­based  API,  implement  RowMatrix transposition,  matrix  truncation  method  needed  
along  with  SVD
• SVD  factorizes  the  document-­feature  matrix  𝑀 (mx𝑛)  into  three  matrices:  𝑈,  Σ,  and  𝑉,  such  
that:
𝑀 = 𝑈 6 Σ 6 𝑉8
Here  m is  the  number  of  legislative  bills  (order  of  10O
), 𝑘 is  the  number  of  concepts,  and  𝑛	
  is  the  
number  of  features  (2#R
)
• The  left  singular  matrix  𝑈	
  is  represented  as  distributed  row-­matrix,  while  Σ and  𝑉	
  are  
sufficiently  small  to  fit  into  Spark  driver  memory
mx𝑘 𝑘x𝑘 𝑛x𝑘
𝑘
m ≫ 𝑛	
   ≫ 𝑘
𝑘
HistogrammarAggregator class:  example
An  example  of  a  custom  dataset  Aggregator  class  using  fill  and  
container  from  Histogrammar
Signature  matrix r   rows b   bands
Columns  2  and  6
Would  make  a  candidate  pair
Columns  6  and  7  are
rather  different.
Example:  Hashing  Bands
Buckets
Example  ML  pipeline  (simplified)

More Related Content

What's hot

Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
Xiaoqian Liu
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 

What's hot (20)

Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 

Similar to Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Summit East talk by Alexey Svyatkovskiy

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
Himanshu Bedi
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
Ian Foster
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
DataWorks Summit
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchJAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
IEEEGLOBALSOFTTECHNOLOGIES
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
IEEEFINALYEARPROJECTS
 
Unit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesUnit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesVasavi College of Engg
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
Zahra Eskandari
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
Crai Macdonald
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 

Similar to Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Summit East talk by Alexey Svyatkovskiy (20)

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchJAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
Unit 2 Principles of Programming Languages
Unit 2 Principles of Programming LanguagesUnit 2 Principles of Programming Languages
Unit 2 Principles of Programming Languages
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Summit East talk by Alexey Svyatkovskiy

  • 1. Large  scale  text  processing  pipeline   with  Spark  ML  and  GraphFrames Alexey  Svyatkovskiy,  Kosuke Imai,  Jim  Pivarski Princeton  University
  • 2. Outline • Policy  diffusion  detection  in  U.S.  legislature:   the  problem   • Solving  the  problem  at  scale • Apache  Spark • Text  processing  pipeline:  core  modules • Text  processing  workflow • Data  ingestion • Pre-­processing  and  feature  extraction • All-­pairs  similarity  calculation • Reformulating  problem  as  a  network  graph  problem • Interactive  analysis • All-­pairs  similarity  join • Candidate  selection:  clustering,  hashing • Policy  diffusion  detection  modes • Interactive  analysis  with  Histogrammar tool • Conclusion  and  next  steps
  • 3. Policy  diffusion  detection:  the  problem • Policy  diffusion  detection  is  a  problem  from  a  wider  class  of  fundamental  text  mining  problems   of  finding  similar  items • Occurs  when  government  decisions  in  a  given  jurisdiction  are  systematically  influenced  by   prior  policy  choices  made  in  other  jurisdictions,  in  a  different  state  on  a  different  year • Example:  “Stand  your  ground”  bills  first  introduced  in   Florida,  Michigan  and  South  Carolina  2005 – A  number  of  states  have  passed  a  form  of   SYG  bills  in  2012  after  T.  Martin’s  death • We  focus  on  a  type  of  policy  diffusion  that  can  be   detected  by  examining  similarity  of  bill  texts States  that  have  passed  SYG  laws States  that  have  passed  SYG  laws  since  T.  Martin’s  death States  that  have  proposed  SYG  laws  after  T.  Martin’s  death Source:  LCAV.org
  • 4. Anatomy  of  Spark  applications • Spark  uses  master/worker  architecture  with  central  coordinators  (drivers)  and  many   distributed  workers  (executors) • We  choose  Scala  for  our  implementation  (both  driver  and  executors)  because  unlike  Python   and  R  it  is  statically  typed,  and  the  cost  of  JVM  communication  is  minimal Cluster  manager Driver  program Worker  node/container Worker  node/container SparkContext Executor Cache Task Task Executor Cache Task Task • A  10  node  SGI  Linux  Hadoop  cluster       • Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU   processors,  256  GB  RAM       • All  the  servers  mounted  on  one  rack  and   interconnected  using  a  10  Gigabit  Ethernet  switch       • Cloudera distribution  of  Hadoop  configured  in  high-­ availability  mode  using  two  namenodes • Schedule  Spark  applications  using  YARN       • Distributed  file  system  (HDFS)   • Alternative  configuration  uses  SLURM  resource  manager   deploying  Spark  in  a  standalone  mode Hardware  specifications
  • 5. Text  processing  pipeline:  core  modules All-pairs document similaritycalculation Represent document pairs and similarities as weighted undirected graph. Perform PageRank and graph queries using GraphFrames package Raw unstructured text data from a static source (HDFS), orstreaming (Kafka or Akkainput source) Structured, compressed datain Avro format Pre-processing, Feature extraction, Dimension reduction Candidate selection:clustering, hashing Data  ingestion Distributed processing withSpark ML Statistical analysis, aggregation using Histogrammarpackage,plottingusing Bokeh and Matplotlib front-ends Interactive analysis
  • 6. Data  ingestion • Our  dataset  is  based  on  the  LexisNexis  StateNet dataset which  contains  a  total  of  more  than  7  million  legislative   bills  from  50  US  states  from  1991  to  2016 • The  initial  dataset  contains  unstructured  text  documents   sub-­divided  by  year  and  state • We  use  Apache  Avro  serialization  framework  to  store  the   data  and  access  it  efficiently  in  Spark  applications – Binary  JSON  meta-­format  with  schema  stored  with  the  data – Row-­oriented,  good  for  nested  schemas.  Supports  schema  evolution Unique  identifier  of  the  bill Used  to  construct  predicates  and  filter the  data  before  calculating  joins Entire  contents  of  the  bill  as  a  string,  not   read  into  memory  during  candidate  selection   and  filtering  steps,  thanks  to  the  Avro  schema   evolution  property Example  raw  bill
  • 7. Analysis  workflow1 Extract  candidate   documents: those  feature  vectors that  we  need to  test  for  similarity Prepare  multisets of  tokens. N  consecutive  words, optionally  add   bag-­of-­word features.   Feature  vectors  are  so  high   dimension,or  so  many  that   they  cannot  all  fit  in  memory. SVD  performs  semantic   decomposition  of  TF-­IDF   matrix  and  extracts  concepts   which  are  most  relevant  for   clustering Cleaning,   normalization,   stopword removal Tokenization,   n-­gram TF-­IDF K-­means   clustering All-­pairs   similarity  in   each  cluster Truncated  SVD Feature  weighting  to  reflect   importance  of  a  term  t to  a   document  d in  corpus  D Words  appearing  very   frequently  across  the   corpus  bare  little  meaning Raw  text  of  legislative   proposals  contains  spurious   white  spaces  and  non-­ alphanumeric  characters, which  bare  no  meaning  for   analysis  and  often  represent   an  obstacle  for  tokenization Dataframe API RDD-­based  (linear   algebra,  BLAS) RDD-­based
  • 8. Analysis  workflow2 Extract  candidate   documents: a pair  of  documents   that  hashes  into  the   same  bucket  for  a   fraction   of  bands  makes  a   candidate  pair Prepare  sets  of  tokens   or  shingles. N  consecutive  words   /or  characters N-­gram/N-­ shingle Min  hashing Locality   sensitive   hashing All-­pairs   similarity  in   each  cluster Extract  minhash signatures: integer  vectors  that   represent  the  sets,  and reflect  their  similarity Dataframe API (optional)   Cleaning,   normalization,   stopword removal
  • 9. Spark  ML:  a  quick  review • Spark  ML  provides  standardized  API  for  machine  learning  algorithms  to  make  it  easier   to  combine  multiple  algorithms  into  a  single  pipeline  or  workflow – Similarly  to  scikit-­learn  Python  library Basic  components  of  a  Pipeline are:     • Transformer:   an  abstract  class  to  apply  a  transformation  to  dataset/dataframes – UnaryTransformer abstract  class:  takes  an  input  column,  applies  transformation,   and  output  the  result  as  a  new  column – Has  a  transform() method • Estimator: implements an  algorithm  which  can  be  fit  on  a  dataframe to  produce  a   Transformer.   For  instance: a  learning  algorithm  is  an  Estimator  which  is  trained  on  a   dataframe to  produce  a  model.     – Has  a  fit() method • Parameter:  an  API  to  pass  parameters  to  Transformers  and  Estimators  
  • 10. Putting  it  all  into  Pipeline • Preprocessing  and  feature   extraction  steps   – Use  StopWordsRemover,  RegexTokenizer, MinHashLSH,  HashingTF transformers     – IDF estimator   • Prepare  custom  transformers   to  plug  into  Pipeline – Ex:  extend  UnaryTransformer,   override  createTransformerFunc using  custom  UDF • Clustering KMeans and  LSH • Ability  to  perform  hyper-­parameter search  and  cross-­validation Example  ML  pipeline  (see  backup  for  a  full  snippet) Example  custom  transformer
  • 11. All-­pairs  similarity:  overview • Our  goal  is  to  go  beyond  identifying  the  diffusion  topics:  “stand  your  ground   bills”,  cyberstalking,  marijuana  laws.  But  also  to  perform  an  all-­pairs   comparison   • Previous  work  in  policy  diffusion  has  been  unable  to  make  an  all-­pairs   comparison  between  bills  for  a  lengthy  time  period  because  of  computational   intensity – Brute-­force  all-­pairs  calculation   between  the  texts  of  the  state  bills  requires  calculating  a   cross-­join,  yielding  O(10#$)  distinct  pairs  on  the  dataset  considered – As  a  substitute,  scholars  studied  single  topic  areas • Focusing  on  the  document  vectors  which  are  likely  to  be  highly  similar  is   essential  for  all-­pairs  comparison  at  scale • Modern  studies  employ  variations  of  nearest-­neighbor  search,  locality  sensitive   hashing  (LSH),  as  well  as  sampling  techniques  to  select  a  subset  of  rows  of   TF-­IDF  matrix  based  on  the  sparsity (DIMSUM) • Our  approach  utilizes  clustering  and  hashing  methods  (details  on  the  next   slide)
  • 12. All-­pairs  similarity,  workflow1:  clustering • First  approach  utilizes  K-­means  clustering  to  identify  groups  of  candidate  documents  which  are  likely   to  belong  to  the  same  diffusion  topic,  reducing  the  number  of  comparisons  in  the  all-­pairs  similarity  join   calculation – Distance-­based  clustering  using  fast  square  distance  calculation  in  Spark  ML – Introduce  a  dataframe column  with  a  cluster  label.  Perform   all-­pairs  calculation  within  each  cluster – Determine  the  optimum  number  of  clusters  empirically,  by  repeating  the   calculation  for  a  range  of  values  of  k,  scoring  on  a  processing  time   versus  WCSSE  plane – 150  clusters  for  a  3  state  subset,  400  clusters  for  the  entire  dataset – 2-­3  orders  of  magnitude  less  combinatorial  pairs  to  calculate  compared   to  the  brute-­force  approach   Brute  force  solution is  too  slow…
  • 13. All-­pairs  similarity,  workflow2:  LSH • N-­shingle  features  with  relatively  large  N  >  5,  hashed,  converted  to  sets • Characteristic  matrix  instead  of  TF-­IDF  matrix  (values  0  or  1) • Extract  MinHash signatures  for  each  column  (document)  using  a  family   of  hash  functions  h1(x),  h2(x),  h3(x),  …  hn(x) – Hash  several  times – The  similarity  of  two  signatures  is  the  fraction  of  the  hash  functions  in  which  they  agree • LSH:  focus  on  pairs  of  signatures  which  are  likely  to  be  from  similar   documents – Hash  columns  of  signature  matrix  M  to  many  buckets – Partition  signature  matrix  into  bands  of  rows.  For  each  band,  hash  each  column  into   k  buckets – A  pair  of  documents  that  hashes  into  the  same  bucket  for  a  fraction   of  bands  makes  a  candidate  pair • Use  MinHashLSH class  in  Spark  2.1 Characteristic  matrix: N-­shingles  rows   M-­documents columns   Signature  matrix: K-­number  of  hash   functions  rows   M-­documents  columns   Ex:  Family   of  4  hash   functions
  • 14. Similarity  measures • The  Jaccard similarity between  two  sets  is  the  size  of  their  intersection  divided  by  the  size   of  their  union: J(𝐴, 𝐵) =   |-∩/| |-∪/| =   |-∩/| - 1 / 2|-∩/| – Key  distance:  consider  feature  vectors  as  sets  of  indices – Hash  distance:  consider  dense  vector  representations   • Cosine  distance  between  feature  vectors C(A,B)= (-6/) |-|6|/| – Convert  distances  to  similarities  assuming  inverse  proportionality,   rescaling  it  to  [0,100]  range,  adding  a  regularization  term Feature  type Similarity measure Workflow1:  K-­means   clustering Unigram, TF-­IDF,  truncated   SVD Cosine, Euclidean Workflow2:  LSH N-­gram with  N=5,  MinHash Jaccard 𝐴 𝐵
  • 15. Interactive  analysis  with  Histogrammar tool • Histogrammar  is  a  suite  of  composable aggregators  with – Language  independent  specification  with  implementations   in  Scala  and  Python – Grammar  of  composable aggregation  routines – Draws  plots  in  Matplotlib and  Bokeh – Unlike  RDD.histogram,  Histogrammar let’s  one  build  2D, profiles,  sum-­of-­squares  statistics  in  an  open-­ended  way   http://histogrammar.org Contact  Jim  Pivarski or  me  to  contribute! Example  interactive  session  with  Histogrammar
  • 16. GraphFrames • GraphFrames  is  an  extension  of  Spark  allowing  to  perform  graph  queries  and   graph  algorithms  on  Spark  dataframes – A  GraphFrame is  constructed  using  two  dataframes (a  dataframe of  nodes   and  an  edge  dataframe),  allowing  to  easily  integrate  the  graph  processing   step  into  the  pipeline  along  with  Spark  ML   – Graph  queries:  like  a  Cypher  query  on  a  graph  database  (e.g.  Neo4j) – Graph  algorithms:  PageRank,  Dijkstra – Possibility  to  eliminate  joins bill1 bill2 similarity FL/2005/SB436   MI/2005/SB1046   91.38 FL/2005/SB436   MI/2005/HB5143   91.29 FL/2005/SB436 SC/2005/SB1131   82.89 Currently  undirected, switch  to  directed  by  time SB346 SB1046 HB5143 SB1131 For  similarity  above   threshold
  • 17. Applications  of  policy  diffusion  detection  tool • The  policy  diffusion  detection  tool  can  be  used  in  a  number  of  modes: – Identification  of  groups  of  diffused  bills  in  the  dataset   given  a  diffusion  topic  (for  instance,  “Stand  your  ground”  policy,   cyberstalking,  marijuana  laws  …) – Discovery  of  diffusion  topics:  consider  top-­similarity  bills  within  each  cluster,   careful  filtering  of  uniform  bills  and  interstate  compact  bills  is  necessary  as   they  would  show  high  similarity  as  well – Identification  of  minimum  cost  paths  connecting  two  specific  legislative   proposals  on  a  graph – Identification  of  the  most  influential  US  states  for  policy  diffusion • The  political  science  research  paper  on  applications  of  the  tool  is  currently  in   progress
  • 18. Performance  summary,  details  on  the  Spark  configuration • The  policy  diffusion  analysis  pipeline  uses  map,  filter  (narrow),  join,   and  aggregateByKey (wide)  transformations – Deterministic  all-­pairs  calculation  from  approach1  involves  a  two-­sided   join:  heavy  shuffle  with  O(100  TB)  intermediate  data – Benefit  from  increasing  spark.shuffle.memoryFraction to  0.6 • Spark  applications  have  been  deployed  on  Hadoop  cluster  with  YARN – 40  executor  containers,  each  using  3  executor  cores  and  15  GB  of   RAM  per  JVM   – Use  external  shuffle  service  inside  the  YARN  node  manager  to  improve   stability  of  memory-­intensive  jobs  with  larger  number  of  executor  containers – Custom  partitioning  to  avoid  struggler  tasks • Calculate  efficiency  of  parallel  execution  as 𝐸 = 89 :;<;=68> Single  executor  case
  • 19. Conclusions • Evaluated  Apache  Spark  framework  for  the  case  of  data-­intensive   machine  learning  problem  of  policy  diffusion  detection  in  US  legislature   – Provided  a  scalable  method  to  calculate  all-­pairs  similarity  based  on  K-­means   clustering  and  MinHashLSH – Implemented  a  text  processing  pipeline  utilizing  Apache  Avro  serialization   framework,  Spark  ML,  GraphFrames,  and  Histogrammar suite  of  data   aggregation   primitives   – Efficiently  calculate  all-­pairs  comparison  between   legislative  bills,  estimate   relationships  between   bills  on  a  graph,  potentially  applicable  to  a  wider  class  of   fundamental   text  mining  problems  of  finding  similar  items • Tuned  Spark  internals    (partitioning,  shuffle)  to  obtain  good  scaling  up   to  O(100)  processing  cores,  yielding  80%  parallel  efficiency • Utilized  Histogrammar tool  as  a  part  of  the  framework  to  enable   interactive  analysis,  allows  a  researcher  to  perform  analysis  in  Scala   language,  integrating  well  with  Hadoop  ecosystem  
  • 20. Thank  You. Alexey  Svyatkovskiy email:  alexeys@princeton.edu twitter:  @asvyatko Kosuke Imai:  kimai@princeton.edu Jim  Pivarski:  pivarski@fnal.gov
  • 22. Service  node  1 Zookeeper Journal  node Primary  namenode httpfs Service  node  2 Zookeeper Journal  node Resource  manager Hive  master Service  node  3 Zookeeper Journal  node Standby  namenode History  server Datanodes Spark HDFS Datanode service Login  node Spark Hue YP  server LDAP … SSH • A  10  node  SGI  Linux  Hadoop  cluster       • Intel  Xeon  CPU  E5-­2680  v2  @  2.80GHz  CPU  processors,  256  GB  RAM       • All  the  servers  mounted  on  one  rack  and  interconnected  using  a  10   Gigabit  Ethernet  switch       • Cloudera distribution  of  Hadoop  configured  in  high-­availability  mode  using  two   namenodes • Schedule  Spark  applications  using  YARN       • Distributed  file  system  (HDFS)  
  • 23. TF-­IDF • TF-­IDF  weighting:  reflect  importance  of  a  term  t to  a  document  d in  corpus  D – HashingTF transformer,  which  maps  feature  to  an  index  by  applying  MurmurHash 3 – IDF estimator  down-­weights  components  which  appear  frequently  in  the  corpus 𝐼𝐷𝐹(𝑡, 𝐷) = log 𝐷 + 1 𝐷𝐹 𝑡, 𝐷 + 1 𝑇𝐹𝐼𝐷𝐹 𝑡, 𝑑, 𝐷 = 𝑇𝐹 𝑡, 𝑑 ●𝐼𝐷𝐹(𝑡, 𝐷)
  • 24. Dimension  reduction:  truncated  SVD • SVD  is  applied  to  the  TF-­IDF  document-­feature  matrix  to  perform  semantic  decomposition   and  extract  concepts  which  are  most  relevant  for  classification • RDD-­based  API,  implement  RowMatrix transposition,  matrix  truncation  method  needed   along  with  SVD • SVD  factorizes  the  document-­feature  matrix  𝑀 (mx𝑛)  into  three  matrices:  𝑈,  Σ,  and  𝑉,  such   that: 𝑀 = 𝑈 6 Σ 6 𝑉8 Here  m is  the  number  of  legislative  bills  (order  of  10O ), 𝑘 is  the  number  of  concepts,  and  𝑛  is  the   number  of  features  (2#R ) • The  left  singular  matrix  𝑈  is  represented  as  distributed  row-­matrix,  while  Σ and  𝑉  are   sufficiently  small  to  fit  into  Spark  driver  memory mx𝑘 𝑘x𝑘 𝑛x𝑘 𝑘 m ≫ 𝑛   ≫ 𝑘 𝑘
  • 25. HistogrammarAggregator class:  example An  example  of  a  custom  dataset  Aggregator  class  using  fill  and   container  from  Histogrammar
  • 26. Signature  matrix r   rows b   bands Columns  2  and  6 Would  make  a  candidate  pair Columns  6  and  7  are rather  different. Example:  Hashing  Bands Buckets
  • 27. Example  ML  pipeline  (simplified)