Group by operation of text data according to fields

Demand:

The text data format is as follows:

akc190|id_drg|name_drg|pdxCode|pdxName|sdxCodes|sdxNames|yka055

0001369157| 101| seizure (-) | G40.901| epilepsy | G40.901 $| epilepsy $| 1946.56
0001370448| 101| seizure (-) | G40.901| epilepsy | G40.901$J40.x00 $| epilepsy $bronchitis $| 2842.77
0001374918| 101| seizure (-) | R56.001| febrile convulsion | R56.001$J03.900 $| febrile convulsion $acute tonsillitis $| 1813.14
0001358030| 101| seizure (-) | R56.001| febrile convulsion | R56.001$J03.900 $| febrile convulsion $acute tonsillitis $| 2209.05
0001368014| 101| seizure (-) | G41.900| status epilepticus | G41.900 $| status epilepticus $| 2986.82
0001384553| 101| seizure (-) | G40.103| symptomatic focal epilepsy | G40.103 $| symptomatic focal epilepsy $| 1944.66
0001383190| 101| seizure (-) | R56.001| febrile convulsion | R56.001$J06.900 $| febrile convulsion $acute upper respiratory infection $| 2532.59

You need to perform group by operation according to id_drg,pdxCode and sdxNames to count the number of medical records + expense data in each case

The code is as follows:

package com.cetc.chinadrgs.auto

import scala.collection.mutable.ListBuffer
import scala.io.Source

/**
  * Created by Shea on 2018/11/24.
  */
object GroupBy2Text extends App{
  val path="C:\\Users\\Shea\\Desktop\\drgs.txt"
  val encode="utf-8"
  val res=getCategoryAll(path,encode,"\\|","1,3")("6","\\$")
  getGroupRes(res,path,encode,"\\|","1,3")("6","\\$")("7","4")
  /**
    * Get total categories
    * @param path File path
    * @param encode Code
    * @param delimiter Separator between fields
    * @param indexs Which fields need to be grouped (here is the index of the field) and spliced with commas
    * @param splitFiled If the values of some grouping fields are multiple splicing values, do you need to split them into sets
    * @param splitFlag Spliced symbols
    */
  def getCategoryAll(path:String,encode:String,delimiter:String,indexs:String)(splitFiled:String="",splitFlag:String): List[Set[String]] ={
    val groupList=new ListBuffer[Set[String]]
    Source.fromFile(path,encode).getLines().foreach{line=>
      val arr=line.split(delimiter)
      val fields=indexs.split("\\,")
      val specialField:Set[String]=delimiter match {
        case ""=>Set("")
        case _=>arr(splitFiled.toInt).split(splitFlag).toSet
      }
      //Splicing common fields
      val common:String=fields.map(index=>arr(index.toInt)).mkString("\001")
      //Splicing with special field splitfile
      val groupContent=specialField ++ Set(common)
      groupList.append(groupContent)
    }
    //Get all the final classes
    val groupRes:List[Set[String]]=groupList.distinct.toList
    groupRes
  }

  /**
    * Result of generation
    * @param groupRes
    * @param path
    * @param encode
    * @param delimiter
    * @param indexs
    * @param splitFiled
    * @param splitFlag
    * @param parms Other field related content to be output (use comma to splice index)
    */
  def getGroupRes(groupRes:List[Set[String]],path:String,encode:String,delimiter:String,indexs:String)(splitFiled:String="",splitFlag:String)(parms:String*): Unit ={
    for (str<-groupRes){
      var count=0//Statistical number
      var groupFileds=""//Grouping field
      //Here, you can specify the content of other fields related to the output -- or you can set several buffer s for several fields
      val buffer = new StringBuffer()

      Source.fromFile(path,encode).getLines().foreach{line=>
        val arr=line.split(delimiter)
        val fields=indexs.split("\\,")
        val specialField:Set[String]=delimiter match {
          case ""=>Set("")
          case _=>arr(splitFiled.toInt).split(splitFlag).toSet
        }

        //Splicing common fields
        val common:String=fields.map(index=>arr(index.toInt)).mkString("\001")
        //Splicing with special field splitfile
        val groupContent=specialField ++ Set(common)
        if(str==groupContent){
          count+=1
          groupFileds=common
          val cont=parms.map(param=>arr(param.toInt)).mkString(",").trim
          buffer.append(cont+"@")
          /*akc190s.append(akc190+",")
          yka055s.append(arr(7)+",")
          pdx=pdxCode+","+pdxName*/
        }
      }
      //Special group fields
      val specialField=str.filterNot(_.contains("\001")).mkString(",")
      //Output final results
      println(s"${groupFileds}|${specialField}|${count}|${buffer.toString}")
    }
  }
}

The final treatment results are as follows:

id_drgpdxcode|sdxNames|yka055,pdxName|yka055,pdxName

101G40.901| epilepsy | 1| 1946.56, epilepsy@
101G40.901|epilepsy, bronchitis|1|2842.77, epilepsy@
101R56.001| febrile convulsion, acute tonsillitis | 2| 1813.14, febrile convulsion @ 2209.05, febrile convulsion@
101G41.900| status epilepticus | 1| 2986.82, status epilepticus@
101G40.103| symptomatic focal epilepsy | 1| 1944.66, symptomatic focal epilepsy@
101R56.001| febrile convulsion, acute upper respiratory infection | 1| 2532.59, febrile convulsion@

Analysis:

The final result is that under the same drg, there are 6 combinations according to the main diagnosis + secondary diagnosis,

The main diagnosis was: r56.001 - > febrile convulsion;

The diagnosis was: febrile convulsion + acute tonsillitis;

The combination of two cases cost 1813.14 and 2209.05 respectively

Keywords: Big Data Scala

Added by CroNiX on Sun, 08 Dec 2019 01:39:30 +0200