Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Mirko Bernardoni & Michael Seddon
Clifford Chance
Revolutionizing the Legal
Industry with Spark, NLP
and Azure Databricks
#UnifiedDataAnalytics #SparkAISummit

3
Mirko’s main role is to build and lead the data science lab.
He strongly believes that the work that Clifford Chance is doing in the
research department offers a unique opportunity to shape and change the
legal sector which is why he also work closely with universities for research.
About Mirko Bernardoni
MIRKO BERNARDONI
Head of Data Science
Clifford Chance
https://www.linkedin.com/i/mirko-bernardoni

4
Michael’s focus is on solving Clifford Chance’s data engineering challenges.
As a core member of the team, he has been involved in the design and development of the
data pipelines crucial to the lab’s success, as well as delivering numerous machine learning
projects helping to reshape Clifford Chance’s operational insights and execution.
About Michael Seddon
MICHAEL SEDDON
Senior Machine Learning Engineer
Clifford Chance
https://www.linkedin.com/in/mrmichaelseddon/

We are one of the world’s
pre-eminent law firms with
significant depth and range of
resources across five continents
Clifford Chance
As a global law firm we are able to support
clients at both a local and international level
across Europe, Asia Pacific, the Americas,
the Middle East and Africa.
Our global, cross-discipline teams advise on a
full range of legal solutions. We have a global
view, and through our sector approach, a
detailed understanding of our clients’ business,
its drivers and competitive landscapes.
Our Responsible Business strategy is integral
to our firm strategy. It guides how we conduct
our core business, how we develop and
support our people, and how we foster closer
collaboration with our clients.
• Doing business – we promote market-
shaping practices in relation to ethics,
professional standards and
risk management
• People – we realise the potential of our
people by creating a safe, healthy and
inclusive workplace, and broadening our
skills and experience
• Community – we partner to support our
community by widening access to
justice, finance and education
• Environment – we manage our
environmental footprint and contribute to
developing a more sustainable world.
At Clifford Chance, we are committed to
delivering a world-class service – providing
the highest quality advice and support
efficiently and effectively, every time.
Our clients, who include corporate
companies across all commercial and
industrial sectors, governments, regulators,
trade bodies and not-for-profit organisations
are at the heart of how we work.
Understanding what our clients value and
aligning with their needs underpins our
approach. We invest heavily to ensure that
clients benefit from our formidable knowledge
and market insights, that they have access to
the best team for the job, and that we bring
the right processes and advanced
technologies to bear on each matter.
Trusted legal advice
for the world’s
Leading businesses of
today and tomorrow
As a single, fully integrated, global partnership, we pride ourselves on our
approachable, collegiate and team-based way of working.
We are a single profit pool, lockstep
partnership. Our ambition is to work
collaboratively across geographies, practices,
product areas and sectors to deliver the best
advice and support to our clients.
Our structure
Our international network Responsible BusinessDelivering value to our clients

Agenda
Data Science in Law
Architecture
First success: Graph Analysis
NLP deep learning in Spark & Databricks
6

Data science in Law
Why would a law firm get involved in creating a data science lab?
7

Digital is changing how business gets done
Current
Law firm +
Digital Technologies
=
Future
Law Firm
• Agile platforms and solutions designed change and adapt
• Human augmentation (AI)
• Draw better insight out of data convert intelligent into action
• Simplify & digital work execution
Digital Trends
• Re-envision existing and enable new business
models
• Embrace different way of bringing people together
• Develop new capabilities that help organizations
transform themselves into digital organizations
New regulations, markets
• Staying ahead by anticipating what’s next
• Industry competitiveness (panels, benchmarking)

What we do in the Data Science Lab
360
information
products
We seek to
better understand our
data, diagnose, predict
business outcomes
and recommend what
to do to achieve
our goals.
Operational
insights,
predictions &
prescriptions
Turn our
data into new
Client
Products &
Services
Provide self-service 360 views of critical data, available to all;
to research further particular topics and enhance the existing
business intelligence reporting.
Provide cross-cutting operational insights, Anticipate what will
happen and recommend what to do to achieve goals.
For example:
• what’s the patterns to our profitability and what actions can we
take?
• can we better predict our fees?
Seek to create new products and services based on our data &
insights we can derive from it.
• Up to your imagination!

10
ELIVER
AI Adoption MATURITY CURVE
Address Cloud
challenges
Innovation pipeline
What are you
trying to do?
AI capability
AImaturity
From research projects to Client Solutions
BI & Apps
• Data driven business
analytics & reporting
Azure Data Services, SQL Server
Power Platform (Power BI, Power Apps,
Power Flow)
SaaS with AI
• Immediate actionable insights
with AI
Already productised: for example for due diligence, risk
management, contract automation,
Also with Office 365, Workplace Analytics)
Ad hoc AI
• AI Research
• Big data insight
• AI idea validations
Azure cloud services
Infrastructure as code
New opportunities
Speed to market
AI “Accelerators”
• Solution specific AI services
& patterns
Azure Cognitive Services
Azure Bot Services
Azure Search
Industry accelerators
AI Products
• Operationalise AI
• Product realisation
• Data Science & Deep AI
capabilities
Open Frameworks
Azure Databricks
Azure Machine Learning
Azure AI Infrastructure
Legal specific

DS-Lab end to end capability
11
Technology pioneers
Evangelization
DevOps / SecOps / Operationalization
Idea & business process management
Data Science
• Research
• Academic collaborations
• Modelling (e.g. ML and DL)
Data pipeline
• Data extraction / collection
• Transformation
• Compliance
• Confidentiality
Production
• Minimum Viable Product
• AI productionisation
• Product development

Architecture
The Legal Industry deals with confidential matters…
is the data science lab system architecture appropriate?
12

Confidentiality
13
User Access
Time validity
Contract limitations
Client Contracts
Audits!

Line of BusinessData Science
Dataset 1
Dataset 3
Dataset 2
Batch
Stream
ML+AI Community and
topics detection
ML+AI Relationship analysis
ML+AI Decision Automation
Data Engineering
Data Processing
Customers
Volume
Variety
Velocity
Many challenges in legal
Blob storage
Data Lake
On-premise

First success: Graph analysis
What is the data science lab spark?
16

17
Understanding client relationships: a work in progress
Understanding customer
relationships is crucial to our
business and revenue
generation
We have a constant
battle getting people to
use our CRM
(customer relationship
management); it is
manually intensive, and
people's time is
pressured
Use our existing data to
uncover insights into our
customer and client
relationships
Client relationship analysis

18
Key
performance
indicator
Define
datasets Academic
literature
& research
Demo
realization
Production
• Email system
• HR system
• CRM system
• Others..
• Knowledge
graph
• Graph
algorithms
• Dataset analysis
• Research
implementation
• User visualisation
• Minimum
viable product
• Productionise
the research
• Agile Project
• Confidential

19

NLP deep learning in Spark &
Databricks
Let’s get serious
20

21
Document classification
LIBOR, Brexit and similar
exercise deals with huge
number of documents.
Document analysis requires
the identification of the
document type
The ability of
recognizing the
document types is a
requirement for many
matters
We often have urgent use
cases from client tenders

22
Document classification
Key
performance
indicator
Define
datasets Academic
literature
& research
Demo
realization
Production
• EDGAR • Text
classification
• Document
classification
• Deep learning
for NLP
• Work in progress • Minimum
viable product
• Productionise
the research
• Agile Project
• Confidential

Document example
• Document size 300 pages or more (more than 150.000 words)
23

Open EDGAR
Open EDGAR
24
Data download and cleaning
U.S.A
SEC Open EDGAR
Document Analysis
Document Cleaning
Blob
Storage
Text Extraction
15M raw documents
7M cleaned documents
250K state of the art
Metadata

Models development
• Cross validation with 10 datasets of 1000
documents
• Considered the following models/architectures:
– CNN
– LSTM
– Doc2Vec
– SVM
– BERT
25

Hyperparameter optimisation
• How can we choose the optimal
hyperparameters?
– Grid search
– Bayesian optimization
26

Grid search on single core model
27
Hyperparameters
RDD
.map()
Single
core
model
Single
core
model
Single
core
model
Single
core
model
Executor
Single
core
model Dataframe

How does mlflow manage multithreading
in Spark?
30
• Created a class instead of using global variables
• Implemented __enter__, __exit__, __del__
methods for using the same notation

How does mlflow manage multithreading in
Spark?
class MtMlFlow:
def __init__(self, run_id=None,
experiment_name=None, experiment_id=None, mlflow_run:
Run = None):
@staticmethod
def get_experiment_id(experiment_name):
def __enter__(self):
def __exit__(self, exc_type, exc_val, exc_tb):
def __del__(self):
@property
def run(self):
@property
def run_id(self):
def log_param(self, key, value):
def set_tag(self, key, value):
def log_metric(self, key, value, step=None):
def log_metrics(self, metrics, step=None):
def log_params(self, params):
def set_tags(self, tags):
def log_artifact(self, local_path, …):
def log_artifacts(self, local_dir, …):
@classmethod
def create_experiment(cls, name, …):
def get_artifact_uri(self, artifact_path=None):
31
with MtMlFlow(experiment_name='/Shared/my_exp_results') as experiment:
experiment.log_param("batch", hyperparameters[0])
experiment.log_param("vector_size", hyperparameters[1])
experiment.log_param("window_size", hyperparameters[2])
experiment.log_param("cost", hyperparameters[3])
experiment.log_param("gamma", hyperparameters[4])
...
experiment.log_artifact(file_path)

Bayesian optimization for hyperparameter optimisation
32
Executor
Deep Learning Model
Executor
Deep Learning Model
Notebooks
Deep Learning Model
Hyperopt
Hyperparameters definition
Optimal
hyperparameter
combination
> 500 runsSparkTrials

Hyperparameter optimisation
35
Number of
documents
Number of
executors
Number of CPU
per Executor
Time per
run
Elapsed
Time
Total
CPU time
Number
of runs
Grid search 10000 10 32 75m 8h 100 days 1920
Bayesian
optimisation (TPE*) 10000 10 2 GPU 25m 21h 9 days 500
TPE = Tree of Parzen Estimators

What results?
• The best model achieve F1 score of 98.5 %
36

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Run_parallel implementation
41

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

Similar to Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance