parsing - How to load in Spark 2.1 XML file with repeating tags and attributes? -


i have xml file such structure:

<?xml version="1.0"?> <catalog> <new>    <book id="bk101" language="en">       <author id="4452" primary="true">gambardella, matthew</author>       <title primary="true">xml developer's guide</title>       <genre primary="false">computer</genre>       <publisher primary="true" id="us124">         <firm id="4124">amazon llc</firm>         <address>ny, usa</address>         <email type="official">books@amazon.com</email>         <contact_person id="3351">             <name>rajesh k.</name>             <email type="personal">rajesh@amazon.com</email>         </contact_person>       </publisher>     </book>    <book id="bk103" language="en">       <author id="4452" primary="true">corets, eva</author>       <title primary="true">maeve ascendant</title>       <genre primary="false">fantasy</genre>       <publisher primary="true" id="us136">         <firm id="4524">oreally llc</firm>         <address>ny, usa</address>         <email type="official">books@oreally.com</email>         <contact_person id="1573">             <name>prajakta g.</name>             <email type="personal">prajakta@oreally.com</email>         </contact_person>       </publisher>     </book>    </new>    <removed>    <book id="bk104" language="en">       <author id="4452" primary="true">corets, eva</author>       <title primary="true">oberon's legacy</title>       <genre primary="false">fantasy</genre>       <publisher primary="true" id="us137">         <firm id="4524">oreally llc</firm>         <address>ny, usa</address>         <email type="official">books@oreally.com</email>         <contact_person id="1573">             <name>prajakta g.</name>             <email type="personal">prajakta@oreally.com</email>         </contact_person>       </publisher>     </book>    </removed>    </catalog> 

how load in dataset? tried follow example from databricks, received error: analysysexception: reference '_id' ambiguous, be: _id#1, _id#3

i've replaced in structtype schema structfield '_id' '_id#1', '_id#2' , on,

but received error:

exception in thread "main" java.lang.exceptionininitializererror                 @ org.apache.spark.sparkcontext.withscope(sparkcontext.scala:701)                 @ org.apache.spark.sparkcontext.newapihadoopfile(sparkcontext.scala:1094)                 @ com.databricks.spark.xml.util.xmlfile$.withcharset(xmlfile.scala:46)                 @ com.databricks.spark.xml.defaultsource$$anonfun$createrelation$1.apply(defaultsource.scala:62)                 @ com.databricks.spark.xml.defaultsource$$anonfun$createrelation$1.apply(defaultsource.scala:62)                 @ com.databricks.spark.xml.xmlrelation.buildscan(xmlrelation.scala:54)                 @ com.databricks.spark.xml.xmlrelation.buildscan(xmlrelation.scala:63)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$12.apply(datasourcestrategy.scala:343)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$12.apply(datasourcestrategy.scala:343)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$prunefilterproject$1.apply(datasourcestrategy.scala:384)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$$anonfun$prunefilterproject$1.apply(datasourcestrategy.scala:383)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$.prunefilterprojectraw(datasourcestrategy.scala:464)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$.prunefilterproject(datasourcestrategy.scala:379)                 @ org.apache.spark.sql.execution.datasources.datasourcestrategy$.apply(datasourcestrategy.scala:339)                 @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$1.apply(queryplanner.scala:62)                 @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$1.apply(queryplanner.scala:62)                 @ scala.collection.iterator$$anon$12.nextcur(iterator.scala:434)                 @ scala.collection.iterator$$anon$12.hasnext(iterator.scala:440)                 @ scala.collection.iterator$$anon$12.hasnext(iterator.scala:439)                 @ org.apache.spark.sql.catalyst.planning.queryplanner.plan(queryplanner.scala:92)                 @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2$$anonfun$apply$2.apply(queryplanner.scala:77)                 @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2$$anonfun$apply$2.apply(queryplanner.scala:74)                 @ scala.collection.traversableonce$$anonfun$foldleft$1.apply(traversableonce.scala:157)                 @ scala.collection.traversableonce$$anonfun$foldleft$1.apply(traversableonce.scala:157)                 @ scala.collection.iterator$class.foreach(iterator.scala:893)                 @ scala.collection.abstractiterator.foreach(iterator.scala:1336)                 @ scala.collection.traversableonce$class.foldleft(traversableonce.scala:157)                 @ scala.collection.abstractiterator.foldleft(iterator.scala:1336)                 @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2.apply(queryplanner.scala:74)                 @ org.apache.spark.sql.catalyst.planning.queryplanner$$anonfun$2.apply(queryplanner.scala:66)                 @ scala.collection.iterator$$anon$12.nextcur(iterator.scala:434)                 @ scala.collection.iterator$$anon$12.hasnext(iterator.scala:440)                 @ org.apache.spark.sql.catalyst.planning.queryplanner.plan(queryplanner.scala:92)                 @ org.apache.spark.sql.execution.queryexecution.sparkplan$lzycompute(queryexecution.scala:79)                 @ org.apache.spark.sql.execution.queryexecution.sparkplan(queryexecution.scala:75)                 @ org.apache.spark.sql.execution.queryexecution.executedplan$lzycompute(queryexecution.scala:84)                 @ org.apache.spark.sql.execution.queryexecution.executedplan(queryexecution.scala:84)                 @ org.apache.spark.sql.dataset.withtypedcallback(dataset.scala:2791)                 @ org.apache.spark.sql.dataset.head(dataset.scala:2112)                 @ org.apache.spark.sql.dataset.take(dataset.scala:2327)                 @ org.apache.spark.sql.dataset.showstring(dataset.scala:248)                 @ org.apache.spark.sql.dataset.show(dataset.scala:636)                 @ org.apache.spark.sql.dataset.show(dataset.scala:595) 

solution found solved second error:

add in .pom file old jackson version: jackson-core v.2.6.7 , jackson-databind 2.6.7.


Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

c# - Selenium Authentication Popup preventing driver close or quit -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -