Successfully reported this slideshow.
Your SlideShare is downloading. ×

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 41 Ad

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

Download to read offline

From zero to data science in a legal firm: how one of the world’s largest law firms is reshaping operations with advanced analytics. Clifford Chance LLP is one of the ten largest law firms in the world. With thousands of global clients their teams handle millions of legal documents every year.

The data science team will share their approach to building an agile data science lab from zero on top of Apache Spark, Azure Databricks and MLflow. They will deep dive into how they used deep learning for natural language processing in the classification of large documents using MLflow and Hyperopt for model comparison and hyperparameter optimization.

From zero to data science in a legal firm: how one of the world’s largest law firms is reshaping operations with advanced analytics. Clifford Chance LLP is one of the ten largest law firms in the world. With thousands of global clients their teams handle millions of legal documents every year.

The data science team will share their approach to building an agile data science lab from zero on top of Apache Spark, Azure Databricks and MLflow. They will deep dive into how they used deep learning for natural language processing in the classification of large documents using MLflow and Hyperopt for model comparison and hyperparameter optimization.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Clifford Chance

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Mirko Bernardoni & Michael Seddon Clifford Chance Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks #UnifiedDataAnalytics #SparkAISummit
  3. 3. 3 Mirko’s main role is to build and lead the data science lab. He strongly believes that the work that Clifford Chance is doing in the research department offers a unique opportunity to shape and change the legal sector which is why he also work closely with universities for research. About Mirko Bernardoni MIRKO BERNARDONI Head of Data Science Clifford Chance https://www.linkedin.com/i/mirko-bernardoni
  4. 4. 4 Michael’s focus is on solving Clifford Chance’s data engineering challenges. As a core member of the team, he has been involved in the design and development of the data pipelines crucial to the lab’s success, as well as delivering numerous machine learning projects helping to reshape Clifford Chance’s operational insights and execution. About Michael Seddon MICHAEL SEDDON Senior Machine Learning Engineer Clifford Chance https://www.linkedin.com/in/mrmichaelseddon/
  5. 5. We are one of the world’s pre-eminent law firms with significant depth and range of resources across five continents Clifford Chance As a global law firm we are able to support clients at both a local and international level across Europe, Asia Pacific, the Americas, the Middle East and Africa. Our global, cross-discipline teams advise on a full range of legal solutions. We have a global view, and through our sector approach, a detailed understanding of our clients’ business, its drivers and competitive landscapes. Our Responsible Business strategy is integral to our firm strategy. It guides how we conduct our core business, how we develop and support our people, and how we foster closer collaboration with our clients. • Doing business – we promote market- shaping practices in relation to ethics, professional standards and risk management • People – we realise the potential of our people by creating a safe, healthy and inclusive workplace, and broadening our skills and experience • Community – we partner to support our community by widening access to justice, finance and education • Environment – we manage our environmental footprint and contribute to developing a more sustainable world. At Clifford Chance, we are committed to delivering a world-class service – providing the highest quality advice and support efficiently and effectively, every time. Our clients, who include corporate companies across all commercial and industrial sectors, governments, regulators, trade bodies and not-for-profit organisations are at the heart of how we work. Understanding what our clients value and aligning with their needs underpins our approach. We invest heavily to ensure that clients benefit from our formidable knowledge and market insights, that they have access to the best team for the job, and that we bring the right processes and advanced technologies to bear on each matter. Trusted legal advice for the world’s Leading businesses of today and tomorrow As a single, fully integrated, global partnership, we pride ourselves on our approachable, collegiate and team-based way of working. We are a single profit pool, lockstep partnership. Our ambition is to work collaboratively across geographies, practices, product areas and sectors to deliver the best advice and support to our clients. Our structure Our international network Responsible BusinessDelivering value to our clients
  6. 6. Agenda Data Science in Law Architecture First success: Graph Analysis NLP deep learning in Spark & Databricks 6
  7. 7. Data science in Law Why would a law firm get involved in creating a data science lab? 7
  8. 8. Digital is changing how business gets done Current Law firm + Digital Technologies = Future Law Firm • Agile platforms and solutions designed change and adapt • Human augmentation (AI) • Draw better insight out of data convert intelligent into action • Simplify & digital work execution Digital Trends • Re-envision existing and enable new business models • Embrace different way of bringing people together • Develop new capabilities that help organizations transform themselves into digital organizations New regulations, markets • Staying ahead by anticipating what’s next • Industry competitiveness (panels, benchmarking)
  9. 9. What we do in the Data Science Lab 360 information products We seek to better understand our data, diagnose, predict business outcomes and recommend what to do to achieve our goals. Operational insights, predictions & prescriptions Turn our data into new Client Products & Services Provide self-service 360 views of critical data, available to all; to research further particular topics and enhance the existing business intelligence reporting. Provide cross-cutting operational insights, Anticipate what will happen and recommend what to do to achieve goals. For example: • what’s the patterns to our profitability and what actions can we take? • can we better predict our fees? Seek to create new products and services based on our data & insights we can derive from it. • Up to your imagination!
  10. 10. 10 ELIVER AI Adoption MATURITY CURVE Address Cloud challenges Innovation pipeline What are you trying to do? AI capability AImaturity From research projects to Client Solutions BI & Apps • Data driven business analytics & reporting Azure Data Services, SQL Server Power Platform (Power BI, Power Apps, Power Flow) SaaS with AI • Immediate actionable insights with AI Already productised: for example for due diligence, risk management, contract automation, Also with Office 365, Workplace Analytics) Ad hoc AI • AI Research • Big data insight • AI idea validations Azure cloud services Infrastructure as code New opportunities Speed to market AI “Accelerators” • Solution specific AI services & patterns Azure Cognitive Services Azure Bot Services Azure Search Industry accelerators AI Products • Operationalise AI • Product realisation • Data Science & Deep AI capabilities Open Frameworks Azure Databricks Azure Machine Learning Azure AI Infrastructure Legal specific
  11. 11. DS-Lab end to end capability 11 Technology pioneers Evangelization DevOps / SecOps / Operationalization Idea & business process management Data Science • Research • Academic collaborations • Modelling (e.g. ML and DL) Data pipeline • Data extraction / collection • Transformation • Compliance • Confidentiality Production • Minimum Viable Product • AI productionisation • Product development
  12. 12. Architecture The Legal Industry deals with confidential matters… is the data science lab system architecture appropriate? 12
  13. 13. Confidentiality 13 User Access Time validity Contract limitations Client Contracts Audits!
  14. 14. Line of BusinessData Science Dataset 1 Dataset 3 Dataset 2 Batch Stream ML+AI Community and topics detection ML+AI Relationship analysis ML+AI Decision Automation Data Engineering Data Processing Customers Volume Variety Velocity Many challenges in legal Blob storage Data Lake On-premise
  15. 15. Architecture 15
  16. 16. First success: Graph analysis What is the data science lab spark? 16
  17. 17. 17 Understanding client relationships: a work in progress Understanding customer relationships is crucial to our business and revenue generation We have a constant battle getting people to use our CRM (customer relationship management); it is manually intensive, and people's time is pressured Use our existing data to uncover insights into our customer and client relationships Client relationship analysis
  18. 18. 18 Client relationship analysis Key performance indicator Define datasets Academic literature & research Demo realization Production • Email system • HR system • CRM system • Others.. • Knowledge graph • Graph algorithms • Dataset analysis • Research implementation • User visualisation • Minimum viable product • Productionise the research • Agile Project • Confidential
  19. 19. 19 Client relationship analysis
  20. 20. NLP deep learning in Spark & Databricks Let’s get serious 20
  21. 21. 21 Document classification LIBOR, Brexit and similar exercise deals with huge number of documents. Document analysis requires the identification of the document type The ability of recognizing the document types is a requirement for many matters We often have urgent use cases from client tenders
  22. 22. 22 Document classification Key performance indicator Define datasets Academic literature & research Demo realization Production • EDGAR • Text classification • Document classification • Deep learning for NLP • Work in progress • Minimum viable product • Productionise the research • Agile Project • Confidential
  23. 23. Document example • Document size 300 pages or more (more than 150.000 words) 23
  24. 24. Open EDGAR Open EDGAR 24 Data download and cleaning U.S.A SEC Open EDGAR Document Analysis Document Cleaning Blob Storage Text Extraction 15M raw documents 7M cleaned documents 250K state of the art Metadata
  25. 25. Models development • Cross validation with 10 datasets of 1000 documents • Considered the following models/architectures: – CNN – LSTM – Doc2Vec – SVM – BERT 25
  26. 26. Hyperparameter optimisation • How can we choose the optimal hyperparameters? – Grid search – Bayesian optimization 26
  27. 27. Grid search on single core model 27 Hyperparameters RDD .map() Single core model Single core model Single core model Single core model Executor Single core model Dataframe
  28. 28. 28 Mlflow – single core
  29. 29. 29
  30. 30. How does mlflow manage multithreading in Spark? 30 • Created a class instead of using global variables • Implemented __enter__, __exit__, __del__ methods for using the same notation
  31. 31. How does mlflow manage multithreading in Spark? class MtMlFlow: def __init__(self, run_id=None, experiment_name=None, experiment_id=None, mlflow_run: Run = None): @staticmethod def get_experiment_id(experiment_name): def __enter__(self): def __exit__(self, exc_type, exc_val, exc_tb): def __del__(self): @property def run(self): @property def run_id(self): def log_param(self, key, value): def set_tag(self, key, value): def log_metric(self, key, value, step=None): def log_metrics(self, metrics, step=None): def log_params(self, params): def set_tags(self, tags): def log_artifact(self, local_path, …): def log_artifacts(self, local_dir, …): @classmethod def create_experiment(cls, name, …): def get_artifact_uri(self, artifact_path=None): 31 with MtMlFlow(experiment_name='/Shared/my_exp_results') as experiment: experiment.log_param("batch", hyperparameters[0]) experiment.log_param("vector_size", hyperparameters[1]) experiment.log_param("window_size", hyperparameters[2]) experiment.log_param("cost", hyperparameters[3]) experiment.log_param("gamma", hyperparameters[4]) ... experiment.log_artifact(file_path)
  32. 32. Bayesian optimization for hyperparameter optimisation 32 Executor Deep Learning Model Executor Deep Learning Model Notebooks Deep Learning Model Hyperopt Hyperparameters definition Optimal hyperparameter combination > 500 runsSparkTrials
  33. 33. Hyperopt example 33
  34. 34. Hyperparameters results 34
  35. 35. Hyperparameter optimisation 35 Number of documents Number of executors Number of CPU per Executor Time per run Elapsed Time Total CPU time Number of runs Grid search 10000 10 32 75m 8h 100 days 1920 Bayesian optimisation (TPE*) 10000 10 2 GPU 25m 21h 9 days 500 TPE = Tree of Parzen Estimators
  36. 36. What results? • The best model achieve F1 score of 98.5 % 36
  37. 37. Q&A
  38. 38. Thank you!
  39. 39. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  40. 40. 40
  41. 41. Run_parallel implementation 41

×